Xinglab / rmats-turbo

Other
223 stars 55 forks source link

Running post step separately #287

Open EfiRahamim opened 1 year ago

EfiRahamim commented 1 year ago

Hi, I have a question regarding the post step. I ran rMATS, using Docker, without separating the steps and I got the results. Now I'm interested in running again rMATS, but with the "--paired-stat" flag, on the same BAM files. So instead of running the whole program again, I was wondering if I could run just the post step, on the ".rmats" files that I got from the previous run (which are not separated by "prep_1"/"prep_2" prefixes). Can the "post" step run on ".rmats" files that were not generated by running the "prep" step separately? Thank you in advance, Efi

EricKutschera commented 1 year ago

Yes, you can use the .rmats files from a previous run of rMATS that used --task both (the default)

EfiRahamim commented 1 year ago

Thank you. I ran --task post on the previous .rmats files but I did not get any significant results. Is that possible? Also, why the exon events count in the summary file does not match the exon events count in the output?

from the summary.txt: EventType TotalEventsJC TotalEventsJCEC SignificantEventsJC SigEventsJCSample1HigherInclusion SigEventsJCSample2HigherInclusion SignificantEventsJCEC SigEventsJCECSample1HigherInclusion SigEventsJCECSample2HigherInclusion SE 202784 203877 0 0 0 0 0 0 A5SS 0 0 0 0 0 0 0 0 A3SS 0 0 0 0 0 0 0 0 MXE 65077 65083 0 0 0 0 0 0 RI 0 0 0 0 0 0 0 0

from the output:

gtf: 140.19701004 There are 82805 distinct gene ID in the gtf file There are 82805 distinct transcript ID in the gtf file There are 82805 one-transcript genes in the gtf file There are 986495 exons in the gtf file There are 4045 one-exon transcripts in the gtf file There are 4045 one-transcript genes with only one exon in the transcript Average number of transcripts per gene is 1.000000 Average number of exons per transcript is 11.913471 Average number of exons per transcript excluding one-exon tx is 12.473972 Average number of gene per geneGroup is 1354.767261 statistic: 0.020406961441 loadsg: 57.4373350143

========== Done processing each gene from dictionary to compile AS events Found 204286 exon skipping events Found 65088 exon MX events Found 0 alt SS events There are 0 alt 3 SS events and 0 alt 5 SS events. Found 14 RI events

ase: 13.0432591438 count: 2406.70595002 Processing count files. Done processing count files.

Thanks again. Efi

EricKutschera commented 1 year ago

The significance depends on the read counts and it is possible that there are no significant events. The read counts and p-values are shown in the SE.MATS.JC.txt output file. Can you post some or all of that file?

The event counts in the summary file are based on the final output files which have a filtered set of events. The event counts printed to stdout are not filtered. The filter requires at least 1 read for each isoform and at least 1 read for each sample group: https://github.com/Xinglab/rmats-turbo/blob/v4.1.2/rmats.py#L303

EfiRahamim commented 1 year ago

I found out what was my problem, I think. The GTF file I used for rMATS was not the same as the GTF file that was used for running STAR on the samples. But I encountered a different problem -
I ran again, prep and post steps separately, with the correct --gtf file. I have 2 groups, normal and tumor, containing 112 samples each. All the P-values and FRD are 1. I read previous posts about that issue and I understood that it might relate to the read counts. In my results, the IJC of both samples seems ok but the SJC's are not (most of them are zero). What can cause this?

Attaching here some of the outputs: log_post.txt 2023-04-24-08_09_36_183785_read_outcomes_by_bam.txt log_prep_tumor.txt log_prep_normal.txt 2023-04-24-08_02_41_003042_read_outcomes_by_bam.txt SE.MATS.JC_example.txt

The [AS].MATS.JC/EJC.txt are pretty big and copy some of them is a challenge since the IJC/SJC_SAMPLE columns are long so I took about 100 rows from the SE.MATS.JC and copy it into a new file.

Thanks again for your help and the quick responses.

EricKutschera commented 1 year ago

I don't see anything obviously wrong with those files. The p-values are all 1 which is strange, but it makes sense based on the read counts. For each event in the example file, almost all of the reads support just 1 of the isoforms. For most events it's the inclusion isoform that has almost all of the counts, but there are some events where the skipping isoform has almost all of the counts

It could be that this dataset just doesn't have evidence for alternative splicing. Maybe there was something about how the reads were generated or the alignment procedure that caused the alignments to be off in some way

EfiRahamim commented 1 year ago

Can it be related to the --gtf file perhaps? In my case, the GTF file is of the "gencode" annotation, v28.