Xinglab / rmats-turbo

Other
228 stars 55 forks source link

No Splicing Events written in any output file,although in stdout I can see many that were counted. #89

Closed mariakondili closed 3 years ago

mariakondili commented 3 years ago

Hello, I find it really weird to obtain empty files in results folder (A3SS.MATS.JC/JCEC, etc.) while no Error is poping up, and I am seeing in my slurm.out the following :

gtf: 25.90836787223816
There are 57820 distinct gene ID in the gtf file
There are 196520 distinct transcript ID in the gtf file
There are 35613 one-transcript genes in the gtf file
There are 1196293 exons in the gtf file
There are 24864 one-exon transcripts in the gtf file
There are 21893 one-transcript genes with only one exon in the transcript
Average number of transcripts per gene is 3.398824
Average number of exons per transcript is 6.087386
Average number of exons per transcript excluding one-exon tx is 6.824282
Average number of gene per geneGroup is 7.482150
statistic: 0.03287625312805176
novel: 1314.3629331588745
The splicing graph and candidate read have been saved into /shared/projects/mbnl_dct/human_cellline/rMATS_SplicingCounts//Ctrl_vs_DM1_Splicing//tmp//2021-02-17-18:19:13_583532.rmats
save: 0.30976057052612305
loadsg: 0.05762648582458496

Done processing each gene from dictionary to compile AS events
Found 38132 exon skipping events
Found 2040 exon MX events
Found 12653 alt SS events
There are 7727 alt 3 SS events and 4926 alt 5 SS events.
Found 5558 RI events

ase: 2.775017023086548
count: 0.6100132465362549
Processing count files.
Done processing count files.

Only the " fromGTF." files per splicing type are not empty. The others have only headers without any gene record.

Input are BAM files from STAR alignment, gtf is given same as one for STAR, there are 3 replicates per condition, and I run in cluster with 16 cpus-per-task.

My command :

group1="control_bam.txt"
group2="treatment_bam.txt"
gtf="gencode.v19.annotation.gtf"
cores=16

python ${rmats_dir}/rmats.py  --b1 ${group1}  --b2  ${group2} \
--gtf ${gtf} --readLength 100 \
--od ${outDir} --tmp  ${outDir}/tmp  \
-t "paired"  --libType fr-firststrand   \
--nthread ${cores} --tstat ${cores}  \
--task both

Could sb give me any hints on what may be happening and how to find out ?

EricKutschera commented 3 years ago

If you are able to use rmats version 4.1.1 then the output will contain a section that starts with "read outcome totals across all BAMs". That section shows how many of the input reads were not able to be used along with the reason why. That output should help identify the reason for the empty output files

mariakondili commented 3 years ago

Thanks Eric ! I launched 4.1.1, and found the following information in stdout:

read outcome totals across all BAMs
USED: 355595054
NOT_PAIRED: 0

NOT_NH_1: 200445609
NOT_EXPECTED_CIGAR: 8808873
NOT_EXPECTED_READ_LENGTH: 0

NOT_EXPECTED_STRAND: 0
EXON_NOT_MATCHED_TO_ANNOTATION: 102232863
JUNCTION_NOT_MATCHED_TO_ANNOTATION: 779927
CLIPPED: 160580484
total: 828442810

I have results of Splicing events now, because I corrected readLength =101 from =100(!!!), when I saw NOT_EXPECTED_READ_LENGTH: 458607844 !

Can you tell me though, what the other parameters mean, and if there is something else bad in my bams that I can take care of ? _NOT_NH_1 NOT_EXPECTED_CIGAR EXON_NOT_MATCHED_TO_ANNOTATION JUNCTION_NOT_MATCHED_TOANNOTATION

EricKutschera commented 3 years ago

NOT_NH_1 This means that the alignment did not have the tag NH=1, but rmats requires reads to be uniquely mapped. The NH tag is described in this document: https://samtools.github.io/hts-specs/SAMtags.pdf

NH i Number of reported alignments that contain the query in the current record

And from the STAR manual: https://raw.githubusercontent.com/alexdobin/STAR/master/doc/STARmanual.pdf

The number of loci Nmap a read maps to is given by NH:i:Nmap field

NOT_EXPECTED_CIGAR This means that the cigar in the alignment did not match one of the expected patterns. The expected cigars are either "M" for an exon read, or ("MNM", "MNMNM", ...) for junction reads. Cigar operations are described in: https://samtools.github.io/hts-specs/SAMv1.pdf. It could be that the alignments have insertion or deletion cigar operations (I, D)

EXON_NOT_MATCHED_TO_ANNOTATION This means that the cigar matched the expected exon read pattern "M", but the aligned region was not part of an annotated exon from the input gtf

JUNCTION_NOT_MATCHED_TO_ANNOTATION This means that the cigar matched the expected junction read pattern ("MNM", "MNMNM", ...), but the aligned junctions did not match up with annotated exons from the input gtf

mariakondili commented 3 years ago

Thanks so much for the explanations. I ll see about the non-annotated exons,if we are interested. Very nice improvement of the latest version of rMATS anyway ! Also, I d like to say that the developers of our cluster asked me if it's possible to pass also the turbo version by bioconda/conda env ,to allow an easier installation in the future.

EricKutschera commented 3 years ago

the developers of our cluster asked me if it's possible to pass also the turbo version by bioconda/conda env ,to allow an easier installation in the future.

rMATS turbo v4.1.1 is available on bioconda: https://anaconda.org/bioconda/rmats

barbarainb commented 1 year ago

Hi there I have a similar problem but i dont really know how to fix it.

I am using a non-model species (Gerbillus gerbillus) mapped to Meriones unguiculatus in HISAT2, the reads are paired-end and the length of them is 150 bp. I am using .sorted.bam files and the following command to run rmats:

$ rmats.py --b1 /group/ag_nowick/data/barbara_data/Gerbillus_RNAseq_data/gg_hippocampus/b1_C_gg_hippocampus.txt --b2 /group/ag_nowick/data/barbara_data/Gerbillus_RNAseq_data/gg_hippocampus/b2_H_gg_hippocampus.txt --gtf /group/ag_nowick/data/barbara_data/Gerbillus_RNAseq_data/m_unguiculatus.gtf -t paired --readLength 150 --variable-read-length --nthread 2 --od /group/ag_nowick/data/barbara_data/Gerbillus_RNAseq_data/rmats/hippocampus --tmp /group/ag_nowick/data/barbara_data/Gerbillus_RNAseq_data/rmats/hippocampus

the result files are all empty even though in the end says "done" and describes several splicing events: image

in my "read_outcomes_by_bam" file i have this:

/group/ag_nowick/data/barbara_data/Gerbillus_RNAseq_data/gg_hippocampus/DG03HpA.sorted.bam USED: 0 NOT_PAIRED: 6888988 NOT_NH_1: 3368169 NOT_EXPECTED_CIGAR: 1121087 NOT_EXPECTED_READ_LENGTH: 0 NOT_EXPECTED_STRAND: 0 EXON_NOT_MATCHED_TO_ANNOTATION: 8043308 JUNCTION_NOT_MATCHED_TO_ANNOTATION: 12764646 CLIPPED: 9106837 TOTAL_FOR_BAM: 41293035 /group/ag_nowick/data/barbara_data/Gerbillus_RNAseq_data/gg_hippocampus/DG04HpA.sorted.bam USED: 0 NOT_PAIRED: 5903915 NOT_NH_1: 3243639 NOT_EXPECTED_CIGAR: 1065826 NOT_EXPECTED_READ_LENGTH: 0 NOT_EXPECTED_STRAND: 0 EXON_NOT_MATCHED_TO_ANNOTATION: 7497960 JUNCTION_NOT_MATCHED_TO_ANNOTATION: 11515982 CLIPPED: 8710206 TOTAL_FOR_BAM: 37937528 /group/ag_nowick/data/barbara_data/Gerbillus_RNAseq_data/gg_hippocampus/DG05HpA.sorted.bam

(...)

So i dont really know whats the problem here, i already went to the FASTQC files even, to confirm the length and it is correct (...), could someone maybe help me here? I am not very experienced with this program I would really appreciate some help Thanks in advance

EricKutschera commented 1 year ago

From the read outcomes about 20% of the reads are CLIPPED. You can use those reads by passing --allow-clipping to rmats

About 15% of the reads are NOT_PAIRED which means the aligner did not mark them as properly paired

The main issue seems to be EXON_NOT_MATCHED_TO_ANNOTATION and JUNCTION_NOT_MATCHED_TO_ANNOTATION which together are about 50% of the reads. Those outcomes happen when the read passes the other filters but rmats is unable to match up the alignment to the annotated exons and splice sites. That could happen if the --gtf given to rmats doesn't agree with the reference files used to align the reads. Another potential cause of NOT_MATCHED_TO_ANNOTATION is that the reads just don't align well with the reference. Since you are using a non-model species maybe the reference files don't match the actual exons in your data

barbarainb commented 1 year ago

Hi EricKutschera, thanks for your reply, yesterday I was ging trough the documentation and I added the --allow-clipping to the code, nothing changed, it continues to give :

Done processing each gene from dictionary to compile AS events Found 7053 exon skipping events Found 443 exon MX events Found 830 alt SS events There are 477 alt 3 SS events and 353 alt 5 SS events. Found 1131 RI events

but when I check the result files, I only have headers again.

The thing is I mapped my paired-end G. gerbillus RNAseq samples to unguiculatus with HISAT2, and even though the alignment was like of 50% something almost all the proteins in the used gff at the time were mapped (23000 something).

the thing is, we actually know there is splicing events here, because I have in parallel a colaborator that used star and got a higher mapping and found splicing events in the exact same samples with the exact same species...

Thanks why I think its very strange because if the results are shown and say that there are splicing evens, shouldn´t they be saved in the output text files ?

Thanks again for your kind reply, help is much appreciated kind regards :)

EricKutschera commented 1 year ago

rmats can detect events using information from the GTF and from the bams. There are different files that report the events detected based on different information as described in the README: https://github.com/Xinglab/rmats-turbo/tree/v4.1.2#output

fromGTF.[AS_Event].txt: All identified alternative splicing (AS) events derived from GTF and RNA fromGTF.novelJunction.[AS_Event].txt: Alternative splicing (AS) events which were identified only after considering the RNA (as opposed to analyzing the GTF in isolation). This does not include events with an unannotated splice site fromGTF.novelSpliceSite.[AS_Event].txt: This file contains only those events which include an unannotated splice site. Only relevant if --novelSS is enabled

If the output has USED: 0 then the fromGTF novel files should be empty and all of the events were detected using the gtf. The final output files like [AS_Event].MATS.JC.txt are filtered unless rmats is run with --statoff. The filter removes events without supporting reads for each sample group and each isoform: https://github.com/Xinglab/rmats-turbo/blob/v4.1.2/rmats.py#L335 The MATS output files would be filtered to just the headers if there are no used reads

Portulaca666 commented 12 months ago

Hi EricKutschera, thanks for your reply, yesterday I was ging trough the documentation and I added the --allow-clipping to the code, nothing changed, it continues to give :

Done processing each gene from dictionary to compile AS events Found 7053 exon skipping events Found 443 exon MX events Found 830 alt SS events There are 477 alt 3 SS events and 353 alt 5 SS events. Found 1131 RI events

but when I check the result files, I only have headers again.

The thing is I mapped my paired-end G. gerbillus RNAseq samples to unguiculatus with HISAT2, and even though the alignment was like of 50% something almost all the proteins in the used gff at the time were mapped (23000 something).

the thing is, we actually know there is splicing events here, because I have in parallel a colaborator that used star and got a higher mapping and found splicing events in the exact same samples with the exact same species...

Thanks why I think its very strange because if the results are shown and say that there are splicing evens, shouldn´t they be saved in the output text files ?

Thanks again for your kind reply, help is much appreciated kind regards :)

Hi . I also used HISAT2 and the same with you !

Portulaca666 commented 12 months ago

Can you tell me what should I can do to deal with this matter ? image I only have the header in not FromGTF files .

Portulaca666 commented 12 months ago

gtf: 175.58477663993835 There are 59251 distinct gene ID in the gtf file There are 207289 distinct transcript ID in the gtf file There are 35878 one-transcript genes in the gtf file There are 2214932 exons in the gtf file There are 23283 one-exon transcripts in the gtf file There are 18592 one-transcript genes with only one exon in the transcript Average number of transcripts per gene is 3.498489 Average number of exons per transcript is 10.685237 Average number of exons per transcript excluding one-exon tx is 11.910747 Average number of gene per geneGroup is 1144.576891 statistic: 0.06993341445922852

read outcome totals across all BAMs USED: 2640635110 NOT_PAIRED: 243814325 NOT_NH_1: 642401120 NOT_EXPECTED_CIGAR: 28491631 NOT_EXPECTED_READ_LENGTH: 0 NOT_EXPECTED_STRAND: 0 EXON_NOT_MATCHED_TO_ANNOTATION: 172196967 JUNCTION_NOT_MATCHED_TO_ANNOTATION: 31942217 CLIPPED: 235498435 total: 3994979805

EricKutschera commented 12 months ago

The output shows a lot of used reads USED: 2640635110. If you only have header lines in the output files like SE.MATS.JC.txt then it might be due to certain samples that don't have supporting read counts. You could try looking at the counts after re-running with --statoff. Here's a related post: https://groups.google.com/g/rmats-user-group/c/6X2kIyoQBZI/m/byMMhxLRAwAJ

tud03125 commented 2 weeks ago

If you are able to use rmats version 4.1.1 then the output will contain a section that starts with "read outcome totals across all BAMs". That section shows how many of the input reads were not able to be used along with the reason why. That output should help identify the reason for the empty output files

I have rMATS verison 4.3.0, and I am also getting similar issues, where rMATS seem to show results and not show any errors, yet the output files, esp. JC and JCEC ones, are empty. Does it have to be version 4.1.1 specifically?

EricKutschera commented 2 weeks ago

4.3.0 also shows the read outcome totals. They are printed to stdout and also written to the --tmp directory in files like *_read_outcomes_by_bam.txt

tud03125 commented 2 weeks ago

@EricKutschera They do produce those. But still, when I ran rMATS through conda run rmats.py, using my USCS-formatted BAM files, only the "fromGTF.[AS event].txt" file produced information, while the rest: "fromGTF.novelSpliceSite.[AS event].txt," "JC.raw.input.[AS event}.txt," "JCEC.raw.input.[AS event].txt," "[AS event].MATS.JC.txt" and "[AS event].MATS.JCEC.txt" files only produced the header while the rest is blank. I was asking if I should lower the rMATS version to 4.1.1, or is 4.3.0 shouldn't be making any of these issues?

I'll attach some files here to show you what I meant (all the rMATS files as .txt, and RI-focused results just to not overload them, and since that one's specific to our research): 2024-10-30-18_52_07_654130_read_outcomes_by_bam.txt 2024-10-30-18_52_07_654130_0.txt 2024-10-30-18_52_07_654130_1.txt 2024-10-30-18_52_07_654130_2.txt 2024-10-30-18_52_07_654130_3.txt 2024-10-30-18_52_07_654130_4.txt 2024-10-30-18_52_07_654130_5.txt 2024-10-30-18_52_07_654130_6.txt 2024-10-30-18_52_07_654130_7.txt 2024-10-30-18_52_07_654130_8.txt 2024-10-30-18_52_07_654130_9.txt summary.txt fromGTF.RI.txt fromGTF.novelSpliceSite.RI.txt fromGTF.novelJunction.RI.txt JC.raw.input.RI.txt JCEC.raw.input.RI.txt RI.MATS.JC.txt RI.MATS.JCEC.txt

EricKutschera commented 2 weeks ago

From the read outcomes file, only about 2,000 reads were USED out of about 800 million. About 65% were filtered for NOT_EXPECTED_READ_LENGTH and about 10% were filtered for CLIPPED

If your reads are a fixed length you can double check the value of --readLength. You could also use --variable-read-length and --allow-clipping