Xinglab / rmats-turbo

Other
228 stars 55 forks source link

Get NA in Inclevel1 column when omit b2 files #100

Open NancyZhang97 opened 3 years ago

NancyZhang97 commented 3 years ago

Hi! Thank you for your tool! I'm new to rMATS and now working on the version 4.1.1. I hope to process one single bam file and get the "Inclevel1" for further analysis, so I use --statoff and my code is:

python /home/usr/.conda/envs/envs-name/rMATS/rmats.py --b1 /path/to/b1.txt --gtf /path/to/Homo_sapiens.GRCh38.99.gtf -t paired --readLength 51 --variable-read-length --nthread 4 --od /path/to/result_try --tmp /path/to/temp --statoff

In the final outputs, such as "A3SS.MATS.JC.txt", most of values of IncLevel1 are "NA", just like below:

image

Is it caused by only using single group or the issue of my data?

EricKutschera commented 3 years ago

rMATS detects events that are annotated in the gtf file and it also detects events based on the input bam file. I think that the rows that you are seeing with IncLevel1 as "NA" are for events that were only detected from the gtf file and there are no reads to support those event from your bam file. You could check that by looking at the count columns (IJC_SAMPLE_1, SJC_SAMPLE_1).

When --statoff is used rMATS does not filter the output file. In your case you may want to filter out rows that have 0 for both IJC_SAMPLE_1 and SJC_SAMPLE_1

NancyZhang97 commented 3 years ago

rMATS detects events that are annotated in the gtf file and it also detects events based on the input bam file. I think that the rows that you are seeing with IncLevel1 as "NA" are for events that were only detected from the gtf file and there are no reads to support those event from your bam file. You could check that by looking at the count columns (IJC_SAMPLE_1, SJC_SAMPLE_1).

When --statoff is used rMATS does not filter the output file. In your case you may want to filter out rows that have 0 for both IJC_SAMPLE_1 and SJC_SAMPLE_1

rMATS detects events that are annotated in the gtf file and it also detects events based on the input bam file. I think that the rows that you are seeing with IncLevel1 as "NA" are for events that were only detected from the gtf file and there are no reads to support those event from your bam file. You could check that by looking at the count columns (IJC_SAMPLE_1, SJC_SAMPLE_1).

When --statoff is used rMATS does not filter the output file. In your case you may want to filter out rows that have 0 for both IJC_SAMPLE_1 and SJC_SAMPLE_1

Oh so I need to delete rows that both those 2 columns are 0. Thank you so much!

And I have another question. I just find a closed issue that talked about using rMATs with one group(the last closed issue: #1). They said it was possible to list all of the BAM files of one group in one --b1 files to get IncLevel. I have a few bam files of one group, but I read the manual and it says --b1 contains different replicates. Is it OK to process bam files from different samples in one b1 files? Will it influence the number of detected AS events?

EricKutschera commented 3 years ago

rMATS will use the input gtf and all of the input bam files to detect events. If you provide additional/different input files then rMATS may detect different events. If you want to know which events can be detected for each bam then you would have to run them separately

You can run all of your bam files together if you want. The IncLevel and count columns are comma separated so you can still get the information for each individual bam file

NancyZhang97 commented 3 years ago

rMATS will use the input gtf and all of the input bam files to detect events. If you provide additional/different input files then rMATS may detect different events. If you want to know which events can be detected for each bam then you would have to run them separately

You can run all of your bam files together if you want. The IncLevel and count columns are comma separated so you can still get the information for each individual bam file

So if I want to know the common AS events in a group of samples, I can run them together, and running seperately may have different events combination for different sample.

Thank you so much for your reply!

EricKutschera commented 3 years ago

So if I want to know the common AS events in a group of samples, I can run them together

Finding the common set of AS events for a group of samples from rMATS output can be complicated. Running all the samples together will give you all the AS events detectable by any individual sample and also potentially some events that are only detectable by combining reads from multiple samples

If you want to know the common AS events in a group of samples then you would have to run each sample individually and check the output for each sample to see which events are common to all samples. Some of the events are detected just using the gtf (without using the info in the bam). You can find those events by looking at the different fromGTF.* files described here: https://github.com/Xinglab/rmats-turbo/tree/v4.1.1#output

Events that are detected just based on the gtf are the events that are in fromGTF.[AS_Event].txt, but not in either fromGTF.novelJunction.[AS_Event].txt or fromGTF.novelSpliceSite.[AS_Event].txt. Those events will be common to all samples, but also some of the "novel" events may be common to all samples

A simpler approach would be to just run all of the samples together and then find the events where all of the samples have a non-zero count in both the IJC_SAMPLE_1 and SJC_SAMPLE_1 columns. This simpler approach does have an issue though, which is that it is possible that rMATS would not detect an event when running for a single sample even if that sample has a supporting read for both the inclusion and skipping isoform. For example, the diagram for the SE event (https://github.com/Xinglab/rmats-turbo/tree/v4.1.1#output) has two junctions in the inclusion isoform (the junctions with the red lines). If the gtf does not annotate either of those junctions then in order to detect the event rMATS would need the sample to have supporting reads for both inclusion junctions. However, the output column IJC_SAMPLE_1 does not say which of the inclusion junctions contributed to the counts. Also the output does not say which of the junctions where annotated in the gtf.

So in summary:

NancyZhang97 commented 3 years ago

Many useful tips! I will double check my results. Thank you so much!

beaferbl commented 3 years ago

Hi! this thread is related to a problem I have. I want to get all the AS events in several samples without comparing conditions, so I run rMATS parsing all the bam paths in --b1 and running with --statoff; prep and post steps separately. However, there are events that are not detected but I know they exist because I have seen them in IGV and there are reads suppporting both isoforms. This happens even if I do the prep and post steps for a single sample. Is there an explanation for this issue? Thanks.

EricKutschera commented 3 years ago

Running with all bams in --b1 and with --statoff will let rMATS detect as many events as it can, but there are some events that rMATS will not detect. rMATS requires at least part of a splicing event to be annotated in the gtf file in order to detect it. If all of the exons and exon-exon junctions are annotated, then the event will only be in the fromGTF.[AS_Event].txt file. If the event has annotated exons but an unannotated exon-exon junction then it will be in the fromGTF.novelJunction.[AS_Event].txt file. If --novelSS is used then an event can have an annotated exon, but with one splice endpoint changed from the annotation, and the event will be in the fromGTF.novelSpliceSite.[AS_Event].txt file. The "novel" events are only detected if there are reads to support those events

Do the events that you see in IGV that are not in the rMATS output have coordinates that match up with the gtf file you are giving to rMATS? If you can post the lines from the gtf and the alignments from the bam that you think should be sufficient for rMATS to detect the event, then I can check to see exactly why rMATS does not detect the event

beaferbl commented 3 years ago

Thank you for answering so quickly. I looked at the GTF and I think the problem is the two transcripts have different start coordinates and the event I'm looking for is complex, maybe I can do some changes in the gtf. I also found this thread clarifying https://github.com/Xinglab/rmats-turbo/issues/17

Yuzhang0102 commented 1 year ago

So in summary:

  • When rMATS is run with multiple samples, the output doesn't include enough information to say which samples had the necessary supporting reads to detect which novel isoforms
  • You can run each sample individually to see which events are detected in that sample
  • You can see which events are annotated by looking at the fromGTF.* files

Hi, I'm new to run rMATS, and I also want to get PSI value of all samples, similarly I run rMATS parsing all the bam paths in --b1 and running with --statoff and prep and post steps separately. I have two questions about what you said:

  1. I met a question in *.MATS.JC.txt, there are same exon events in the outputs but they have different IncLevel1, I can't figure out it. Like this: 340 and 341 are the same SE events with same exon_coords, but have different IncLevels1. image

  2. Can I know which samples detect which novel isoforms according to the IJC_SAMPLE_1 and SJC_SAMPLE_1 in *.MATS.JC.txt? Because one sample have one number in IJC_SAMPLE_1 and SJC_SAMPLE_1, if the number is not 0 in IJC_SAMPLE_1, can I say the sample detect this isofom?

EricKutschera commented 1 year ago

Those two events have different coordinates. There are 8 coordinates for each event and both events share the first 6 (78446655, 78446769, 78473117, 78473246, 78421080, 78421262). The last 2 coordinates differ: (78501196, 78501226), (78533593, 78533722)

An SE event would only have 6 coordinates so this is not an SE event. It looks like an MXE event where the downstream exon is different in the two events. From the diagram in the README, changing the downstream exon for an MXE event would change some of the inclusion and skipping junctions so it makes sense that the inclusion levels are different: https://github.com/Xinglab/rmats-turbo/tree/v4.1.2#output

The IJC_SAMPLE_1 and SJC_SAMPLE_1 columns have a separate count for each sample. The counts represent the reads which support each isoform according to the diagram in the README. A non-zero count for a sample means that there is at least 1 read to support the isoform in that sample, but it doesn't mean that rmats would have detected the isoform if run with only that sample. For example, this could happen in a case where there are multiple novel junctions in an isoform (like the 2 inclusion junctions in an SE event) and only 1 of those junctions has supporting reads in a sample. The IJC_SAMPLE_1 column does not say which of the inclusion junctions contributed to the counts so you can't tell if rmats would have detected both novel junctions with just that sample

Yuzhang0102 commented 1 year ago

Thanks for your nice reply. These two are different MXE events exactly, thanks for your answer. But the alternative exon is same in the two MXE event, I want to compare the PSI value between two species according to the orthologous exons, and this alternative exon have 2 PSI values, which confused me which one I should use, can you give me some advice? I understand why IJC_SAMPLE_1 can't tell me which sample detect the isoform. If i use the simple approach to select the event that is non-zero in both IJC_SAMPLE_1 and SJC_SAMPLE_1 of all samples, these events are common AS of all samples, right? At the same time some events will be deleted though they can be detected in some samples. Do I get your points?

EricKutschera commented 1 year ago

The MXE PSI value calculation depends on the downstream exon since that changes one of the inclusion junctions. In this case the first event (ID=340) has a lot of inclusion reads, but the second event (ID=341) has very few inclusion reads. That seems to indicate that there are not many reads for the upstream junction (which is the same in both events) and only 1 downstream junction has reads. Maybe there is another isoform that uses the downstream junction but not the upstream junction. It's not clear what to do with this case. One option is to just ignore these events. Another option is to look at plots of the reads in this region to figure out what is going on (maybe with https://github.com/Xinglab/rmats2sashimiplot). There is also a branch of the rmats code that outputs the counts for each junction in the event. That code has not made it into an rmats release yet but you could try building the code yourself if you want to try it: https://github.com/Xinglab/rmats-turbo/pull/41

For event detection, what you said is mostly correct. There is also a case for events that have novel junctions and rmats would only detect them after seeing the novel junction(s) in some sample. The IJC column gives one count (per sample) for all inclusion junctions in the event. Similarly the SJC column gives one skipping count. You can't tell from the columns which samples had which junctions if there are multiple inclusion junctions or multiple skipping junctions. Just because a sample had some inclusion reads and some skipping reads, it doesn't mean that sample had a supporting read for all the junctions in the event. If the sample didn't have a read for a novel junction then rmats would not have detected that event in that sample. But as long as rmats detects the novel junctions in some samples then it will report the event along with IJC and SJC counts for all samples

Yuzhang0102 commented 1 year ago

Thanks for your thoughtful reply and advice, I learned a lot.

beaferbl commented 1 year ago

Hello, I have a question related to this. I want to get the landscape of splicing events as accurate as possible. So I am comparing previous results of running rMATS with --b1 vs -b2 arguments and the results of running the post step for each sample separately. I have used the same prep files because as I understand these are the same independently of running all the samples together or individually. There are lots of events from the b1 vs b2 results that are not detected when the same samples are run individually. I know that some events are only quantified when all the samples are run together. For example, if an exon skipping event is found in only in sample X, then the event is identified in this sample X and quantified in the others as well. When I run the same samples individually, I would expect to identify this event in sample X and not in the rest; however this is not the case. I can not find an explanation why this can happen. Is there anything that I am missing?

EricKutschera commented 1 year ago

rMATS needs to define all the exons and junctions for the splicing event in order to detect the event. For example the SE event has 3 exons and 3 splice junctions. From the diagram you can see those exons and junctions and what rMATS counts toward each isoform: https://github.com/Xinglab/rmats-turbo/tree/v4.1.2#output

It is possible that some of the information needed to detect the event came from one sample and then some additional information needed to detect the event came from a different sample (for example the upstream inclusion junction in one sample and the downstream inclusion junction in another). If those two samples are run together in the same post step then rMATS could combine that information and detect the event. If either sample is run individually then information for one of the junctions will be missing

Also, if you are trying to match up events from different runs of rMATS then you should only expect certain coordinates to match exactly. For example in the diagram for the SE event the upstreamES and downstreamEE coordinates don't change what rMATS would count as inclusion or skipping reads. If rMATS detects multiple SE events that differ only by upstreamES and downstreamEE it will only report one of those events. If you match only on the other 4 coordinates (upstreamEE, exonStart_0base, exonEnd, downstreamES) then you might find that the event is actually detected in the single sample run

beaferbl commented 1 year ago

Thank you Eric, I did not know that the information could be merged in the post step in that way.