Xinglab / rmats-turbo

Other
221 stars 53 forks source link

rMats Only Gives Results In FromGTF* files #328

Open yr542 opened 11 months ago

yr542 commented 11 months ago

I cannot run rMats the normal way so I have run it from a singularity I pulled from here. The following variable and command is what I used:

The variables:

The command used in a script is:

singularity exec "${singularity_image}" python "${RMATS_SCRIPT}" --b1 "${control_bams_text_file}" --b2 "${experimental_bams_text_file}" --gtf "${gtf_file_path}" -t paired --readLength "${readLength}" --nthread "${nthread}" --od "${output_directory}" --tmp "${tmp_directory}"

But when I check the files outside the temporary directory I get the following:

A3SS.MATS.JCEC.txt                fromGTF.novelSpliceSite.A5SS.txt  JC.raw.input.MXE.txt
A3SS.MATS.JC.txt                  fromGTF.novelSpliceSite.MXE.txt   JC.raw.input.RI.txt
A5SS.MATS.JCEC.txt                fromGTF.novelSpliceSite.RI.txt    JC.raw.input.SE.txt
A5SS.MATS.JC.txt                  fromGTF.novelSpliceSite.SE.txt    MXE.MATS.JCEC.txt
fromGTF.A3SS.txt                  fromGTF.RI.txt                    MXE.MATS.JC.txt
fromGTF.A5SS.txt                  fromGTF.SE.txt                    RI.MATS.JCEC.txt
fromGTF.MXE.txt                   JCEC.raw.input.A3SS.txt           RI.MATS.JC.txt
fromGTF.novelJunction.A3SS.txt    JCEC.raw.input.A5SS.txt           SE.MATS.JCEC.txt
fromGTF.novelJunction.A5SS.txt    JCEC.raw.input.MXE.txt            SE.MATS.JC.txt
fromGTF.novelJunction.MXE.txt     JCEC.raw.input.RI.txt             summary.txt
fromGTF.novelJunction.RI.txt      JCEC.raw.input.SE.txt            
fromGTF.novelJunction.SE.txt      JC.raw.input.A3SS.txt

However, the fromGTF* files seem to have data and the other files have column names but no data. Why is this occurring?

EricKutschera commented 11 months ago

The fromGTF files can have splicing events that rMATS found just from the --gtf (without using --b1 or --b2). The other output files will be filtered down to splicing events that have supporting reads from the bam files (unless rMATS was run with --statoff). If only the fromGTF files have output then rMATS did not find enough supporting reads for any event

rMATS should print to stdout a section saying how many reads were used and how many were filtered out for various reasons. This post has an example: https://github.com/Xinglab/rmats-turbo/issues/298

Can you post the stdout from your rMATS command?

yr542 commented 11 months ago

I got the following in the .out file (paths are removed as I am on an HPC that does not allow me to give the paths):

gtf: 172.01628065109253
There are 47046 distinct gene ID in the gtf file
There are 75520 distinct transcript ID in the gtf file
There are 34535 one-transcript genes in the gtf file
There are 794067 exons in the gtf file
There are 11312 one-exon transcripts in the gtf file
There are 10208 one-transcript genes with only one exon in the transcript
Average number of transcripts per gene is 1.605237
Average number of exons per transcript is 10.514658
Average number of exons per transcript excluding one-exon tx is 12.190926
Average number of gene per geneGroup is 14.973520
statistic: 0.01208353042602539

read outcome totals across all BAMs
USED: 62492
NOT_PAIRED: 113796834
NOT_NH_1: 233212017
NOT_EXPECTED_CIGAR: 14528907
NOT_EXPECTED_READ_LENGTH: 436922200
NOT_EXPECTED_STRAND: 0
EXON_NOT_MATCHED_TO_ANNOTATION: 12125
JUNCTION_NOT_MATCHED_TO_ANNOTATION: 1896
CLIPPED: 73091965
total: 871628436
outcomes by BAM written to:/path/to/output/directory/2023-10-10-11_31_46_842338_read_outcomes_by_bam.txt

novel: 3034.5499470233917
The splicing graph and candidate read have been saved into /path/to/2023-10-10-11_31_46_842338_*.rmats

save: 0.050202131271362305
loadsg: 0.008337736129760742

==========
Done processing each gene from dictionary to compile AS events
Found 7767 exon skipping events
Found 650 exon MX events
Found 6157 alt SS events
There are 4057 alt 3 SS events and 2100 alt 5 SS events.
Found 1184 RI events
==========

ase: 0.7214617729187012
count: 0.17874360084533691
Processing count files.
Done processing count files.

That is all from the command mentioned in the previous response. It does not seem to give an error despite the issues with only the from GTF files having data.

EricKutschera commented 11 months ago

The output shows that very few reads were USED. The main issue seems to be about 50% of the reads being filtered for NOT_EXPECTED_READ_LENGTH. Does the value you used for --readLength match the actual lengths of the reads? You can use --variable-read-length to disable the read length filter

USED: 62492 (0.007%) NOT_PAIRED: 113796834 (13%) NOT_NH_1: 233212017 (26.7%) NOT_EXPECTED_CIGAR: 14528907 (1.6%) NOT_EXPECTED_READ_LENGTH: 436922200 (50.1%) NOT_EXPECTED_STRAND: 0 EXON_NOT_MATCHED_TO_ANNOTATION: 12125 (0.001%) JUNCTION_NOT_MATCHED_TO_ANNOTATION: 1896 (0.0002%) CLIPPED: 73091965 (8.3%) total: 871628436

yr542 commented 11 months ago

Would this be an issue of it not being aligned to the reference?

EricKutschera commented 11 months ago

The NOT_EXPECTED_READ_LENGTH filter is for the length of the read sequence and not about where the read is aligned. The NOT_NH_1 filter which is about 26% of the reads is for reads that were not uniquely mapped to the reference, but some amount of multimapping is expected. The filters EXON_NOT_MATCHED_TO_ANNOTATION and JUNCTION_NOT_MATCHED_TO_ANNOTATION would happen if a read is not aligned to annotated exons, but almost all of the reads were filtered out for other reasons before rmats even checked them against the annotation

yr542 commented 11 months ago

I will try it with the FASTQ files I have using the following command(s):

singularity exec "${singularity_image}" python "${RMATS_SCRIPT}" \
    --s1 "$s1_file" \
    --s2 "$s2_file" \
    --gtf "$gtf_file" \
    --bi "$star_binary_index" \
    -t paired \
    --readLength 50 \
    --variable-read-length \
    --nthread 10 \
    --od "$output_directory" \
    --tmp "$tmp_directory"

Where s1 contains the path to a text file with FASTQs separated by comma that are control and s2 is for experimentals.

yr542 commented 11 months ago

The .out file give me the following (skipping the STAR .out section):

gtf: 7.691898584365845
There are 32057 distinct gene ID in the gtf file
There are 59290 distinct transcript ID in the gtf file
There are 17879 one-transcript genes in the gtf file
There are 481079 exons in the gtf file
There are 4965 one-exon transcripts in the gtf file
There are 4441 one-transcript genes with only one exon in the transcript
Average number of transcripts per gene is 1.849518
Average number of exons per transcript is 8.113999
Average number of exons per transcript excluding one-exon tx is 8.764179
Average number of gene per geneGroup is 11.659769
statistic: 0.007262706756591797

read outcome totals across all BAMs
USED: 309050751
NOT_PAIRED: 526
NOT_NH_1: 277379010
NOT_EXPECTED_CIGAR: 10614991
NOT_EXPECTED_READ_LENGTH: 0
NOT_EXPECTED_STRAND: 0
EXON_NOT_MATCHED_TO_ANNOTATION: 62032656
JUNCTION_NOT_MATCHED_TO_ANNOTATION: 8610036
CLIPPED: 0
total: 667687970
outcomes by BAM written to: /path/to/tmp/2023-10-12-11_10_13_287627_read_outcomes_by_bam.txt

novel: 2162.8882801532745
The splicing graph and candidate read have been saved into /path/to/tmp/2023-10-12-11_10_13_287627_*.rmats
save: 5.496856689453125
loadsg: 0.4511833190917969

==========
Done processing each gene from dictionary to compile AS events
Found 15527 exon skipping events
Found 1122 exon MX events
Found 2779 alt SS events
There are 1663 alt 3 SS events and 1116 alt 5 SS events.
Found 1759 RI events
==========

ase: 0.7611532211303711
count: 15.663371801376343
Processing count files.
Done processing count files.

And that's all. I think it worked? Is this a normal .out file? Why did it work this time?

EricKutschera commented 11 months ago

Yes, it looks like it worked. Since --variable-read-length was used there are no reads filtered for NOT_EXPECTED_READ_LENGTH. Now you have USED: 309050751

yr542 commented 11 months ago

I am not sure why but when I look at the genes that are significant (pvalue less than 0.05, and FDR 0.05 or less) it seems to be different genes that significant from those I found in DEXseq,DEseq2, and CuffDiff runs on the same data. Why is this the case?

EricKutschera commented 11 months ago

rMATS, DEXSeq, DESeq2, and Cuffdiff all have different algorithms. I'm only familiar with the details of rMATS. If you have a specific example of a gene that you think rMATS should or shouldn't find significant then I could look into it if you can share the input files

yr542 commented 11 months ago

As this data is pre-publication data I have to check with my supervisor/professor if she allows it.