Closed mariakondili closed 3 years ago
If you are able to use rmats version 4.1.1 then the output will contain a section that starts with "read outcome totals across all BAMs". That section shows how many of the input reads were not able to be used along with the reason why. That output should help identify the reason for the empty output files
Thanks Eric ! I launched 4.1.1, and found the following information in stdout:
read outcome totals across all BAMs
USED: 355595054
NOT_PAIRED: 0
NOT_NH_1: 200445609
NOT_EXPECTED_CIGAR: 8808873
NOT_EXPECTED_READ_LENGTH: 0
NOT_EXPECTED_STRAND: 0
EXON_NOT_MATCHED_TO_ANNOTATION: 102232863
JUNCTION_NOT_MATCHED_TO_ANNOTATION: 779927
CLIPPED: 160580484
total: 828442810
I have results of Splicing events now, because I corrected readLength =101 from =100(!!!), when I saw NOT_EXPECTED_READ_LENGTH: 458607844 !
Can you tell me though, what the other parameters mean, and if there is something else bad in my bams that I can take care of ? _NOT_NH_1 NOT_EXPECTED_CIGAR EXON_NOT_MATCHED_TO_ANNOTATION JUNCTION_NOT_MATCHED_TOANNOTATION
NOT_NH_1 This means that the alignment did not have the tag NH=1, but rmats requires reads to be uniquely mapped. The NH tag is described in this document: https://samtools.github.io/hts-specs/SAMtags.pdf
NH i Number of reported alignments that contain the query in the current record
And from the STAR manual: https://raw.githubusercontent.com/alexdobin/STAR/master/doc/STARmanual.pdf
The number of loci Nmap a read maps to is given by NH:i:Nmap field
NOT_EXPECTED_CIGAR This means that the cigar in the alignment did not match one of the expected patterns. The expected cigars are either "M" for an exon read, or ("MNM", "MNMNM", ...) for junction reads. Cigar operations are described in: https://samtools.github.io/hts-specs/SAMv1.pdf. It could be that the alignments have insertion or deletion cigar operations (I, D)
EXON_NOT_MATCHED_TO_ANNOTATION This means that the cigar matched the expected exon read pattern "M", but the aligned region was not part of an annotated exon from the input gtf
JUNCTION_NOT_MATCHED_TO_ANNOTATION This means that the cigar matched the expected junction read pattern ("MNM", "MNMNM", ...), but the aligned junctions did not match up with annotated exons from the input gtf
Thanks so much for the explanations. I ll see about the non-annotated exons,if we are interested. Very nice improvement of the latest version of rMATS anyway ! Also, I d like to say that the developers of our cluster asked me if it's possible to pass also the turbo version by bioconda/conda env ,to allow an easier installation in the future.
the developers of our cluster asked me if it's possible to pass also the turbo version by bioconda/conda env ,to allow an easier installation in the future.
rMATS turbo v4.1.1 is available on bioconda: https://anaconda.org/bioconda/rmats
Hi there I have a similar problem but i dont really know how to fix it.
I am using a non-model species (Gerbillus gerbillus) mapped to Meriones unguiculatus in HISAT2, the reads are paired-end and the length of them is 150 bp. I am using .sorted.bam files and the following command to run rmats:
$ rmats.py --b1 /group/ag_nowick/data/barbara_data/Gerbillus_RNAseq_data/gg_hippocampus/b1_C_gg_hippocampus.txt --b2 /group/ag_nowick/data/barbara_data/Gerbillus_RNAseq_data/gg_hippocampus/b2_H_gg_hippocampus.txt --gtf /group/ag_nowick/data/barbara_data/Gerbillus_RNAseq_data/m_unguiculatus.gtf -t paired --readLength 150 --variable-read-length --nthread 2 --od /group/ag_nowick/data/barbara_data/Gerbillus_RNAseq_data/rmats/hippocampus --tmp /group/ag_nowick/data/barbara_data/Gerbillus_RNAseq_data/rmats/hippocampus
the result files are all empty even though in the end says "done" and describes several splicing events:
in my "read_outcomes_by_bam" file i have this:
/group/ag_nowick/data/barbara_data/Gerbillus_RNAseq_data/gg_hippocampus/DG03HpA.sorted.bam USED: 0 NOT_PAIRED: 6888988 NOT_NH_1: 3368169 NOT_EXPECTED_CIGAR: 1121087 NOT_EXPECTED_READ_LENGTH: 0 NOT_EXPECTED_STRAND: 0 EXON_NOT_MATCHED_TO_ANNOTATION: 8043308 JUNCTION_NOT_MATCHED_TO_ANNOTATION: 12764646 CLIPPED: 9106837 TOTAL_FOR_BAM: 41293035 /group/ag_nowick/data/barbara_data/Gerbillus_RNAseq_data/gg_hippocampus/DG04HpA.sorted.bam USED: 0 NOT_PAIRED: 5903915 NOT_NH_1: 3243639 NOT_EXPECTED_CIGAR: 1065826 NOT_EXPECTED_READ_LENGTH: 0 NOT_EXPECTED_STRAND: 0 EXON_NOT_MATCHED_TO_ANNOTATION: 7497960 JUNCTION_NOT_MATCHED_TO_ANNOTATION: 11515982 CLIPPED: 8710206 TOTAL_FOR_BAM: 37937528 /group/ag_nowick/data/barbara_data/Gerbillus_RNAseq_data/gg_hippocampus/DG05HpA.sorted.bam
(...)
So i dont really know whats the problem here, i already went to the FASTQC files even, to confirm the length and it is correct (...), could someone maybe help me here? I am not very experienced with this program I would really appreciate some help Thanks in advance
From the read outcomes about 20% of the reads are CLIPPED. You can use those reads by passing --allow-clipping to rmats
About 15% of the reads are NOT_PAIRED which means the aligner did not mark them as properly paired
The main issue seems to be EXON_NOT_MATCHED_TO_ANNOTATION and JUNCTION_NOT_MATCHED_TO_ANNOTATION which together are about 50% of the reads. Those outcomes happen when the read passes the other filters but rmats is unable to match up the alignment to the annotated exons and splice sites. That could happen if the --gtf given to rmats doesn't agree with the reference files used to align the reads. Another potential cause of NOT_MATCHED_TO_ANNOTATION is that the reads just don't align well with the reference. Since you are using a non-model species maybe the reference files don't match the actual exons in your data
Hi EricKutschera, thanks for your reply, yesterday I was ging trough the documentation and I added the --allow-clipping to the code, nothing changed, it continues to give :
Done processing each gene from dictionary to compile AS events Found 7053 exon skipping events Found 443 exon MX events Found 830 alt SS events There are 477 alt 3 SS events and 353 alt 5 SS events. Found 1131 RI events
but when I check the result files, I only have headers again.
The thing is I mapped my paired-end G. gerbillus RNAseq samples to unguiculatus with HISAT2, and even though the alignment was like of 50% something almost all the proteins in the used gff at the time were mapped (23000 something).
the thing is, we actually know there is splicing events here, because I have in parallel a colaborator that used star and got a higher mapping and found splicing events in the exact same samples with the exact same species...
Thanks why I think its very strange because if the results are shown and say that there are splicing evens, shouldn´t they be saved in the output text files ?
Thanks again for your kind reply, help is much appreciated kind regards :)
rmats can detect events using information from the GTF and from the bams. There are different files that report the events detected based on different information as described in the README: https://github.com/Xinglab/rmats-turbo/tree/v4.1.2#output
fromGTF.[AS_Event].txt
: All identified alternative splicing (AS) events derived from GTF and RNA
fromGTF.novelJunction.[AS_Event].txt
: Alternative splicing (AS) events which were identified only after considering the RNA (as opposed to analyzing the GTF in isolation). This does not include events with an unannotated splice site
fromGTF.novelSpliceSite.[AS_Event].txt
: This file contains only those events which include an unannotated splice site. Only relevant if --novelSS
is enabled
If the output has USED: 0
then the fromGTF novel files should be empty and all of the events were detected using the gtf. The final output files like [AS_Event].MATS.JC.txt
are filtered unless rmats is run with --statoff
. The filter removes events without supporting reads for each sample group and each isoform: https://github.com/Xinglab/rmats-turbo/blob/v4.1.2/rmats.py#L335
The MATS output files would be filtered to just the headers if there are no used reads
Hi EricKutschera, thanks for your reply, yesterday I was ging trough the documentation and I added the --allow-clipping to the code, nothing changed, it continues to give :
Done processing each gene from dictionary to compile AS events Found 7053 exon skipping events Found 443 exon MX events Found 830 alt SS events There are 477 alt 3 SS events and 353 alt 5 SS events. Found 1131 RI events
but when I check the result files, I only have headers again.
The thing is I mapped my paired-end G. gerbillus RNAseq samples to unguiculatus with HISAT2, and even though the alignment was like of 50% something almost all the proteins in the used gff at the time were mapped (23000 something).
the thing is, we actually know there is splicing events here, because I have in parallel a colaborator that used star and got a higher mapping and found splicing events in the exact same samples with the exact same species...
Thanks why I think its very strange because if the results are shown and say that there are splicing evens, shouldn´t they be saved in the output text files ?
Thanks again for your kind reply, help is much appreciated kind regards :)
Hi . I also used HISAT2 and the same with you !
Can you tell me what should I can do to deal with this matter ? I only have the header in not FromGTF files .
gtf: 175.58477663993835 There are 59251 distinct gene ID in the gtf file There are 207289 distinct transcript ID in the gtf file There are 35878 one-transcript genes in the gtf file There are 2214932 exons in the gtf file There are 23283 one-exon transcripts in the gtf file There are 18592 one-transcript genes with only one exon in the transcript Average number of transcripts per gene is 3.498489 Average number of exons per transcript is 10.685237 Average number of exons per transcript excluding one-exon tx is 11.910747 Average number of gene per geneGroup is 1144.576891 statistic: 0.06993341445922852
read outcome totals across all BAMs USED: 2640635110 NOT_PAIRED: 243814325 NOT_NH_1: 642401120 NOT_EXPECTED_CIGAR: 28491631 NOT_EXPECTED_READ_LENGTH: 0 NOT_EXPECTED_STRAND: 0 EXON_NOT_MATCHED_TO_ANNOTATION: 172196967 JUNCTION_NOT_MATCHED_TO_ANNOTATION: 31942217 CLIPPED: 235498435 total: 3994979805
The output shows a lot of used reads USED: 2640635110
. If you only have header lines in the output files like SE.MATS.JC.txt then it might be due to certain samples that don't have supporting read counts. You could try looking at the counts after re-running with --statoff
. Here's a related post: https://groups.google.com/g/rmats-user-group/c/6X2kIyoQBZI/m/byMMhxLRAwAJ
If you are able to use rmats version 4.1.1 then the output will contain a section that starts with "read outcome totals across all BAMs". That section shows how many of the input reads were not able to be used along with the reason why. That output should help identify the reason for the empty output files
I have rMATS verison 4.3.0, and I am also getting similar issues, where rMATS seem to show results and not show any errors, yet the output files, esp. JC and JCEC ones, are empty. Does it have to be version 4.1.1 specifically?
4.3.0 also shows the read outcome totals. They are printed to stdout and also written to the --tmp
directory in files like *_read_outcomes_by_bam.txt
@EricKutschera They do produce those. But still, when I ran rMATS through conda run rmats.py, using my USCS-formatted BAM files, only the "fromGTF.[AS event].txt" file produced information, while the rest: "fromGTF.novelSpliceSite.[AS event].txt," "JC.raw.input.[AS event}.txt," "JCEC.raw.input.[AS event].txt," "[AS event].MATS.JC.txt" and "[AS event].MATS.JCEC.txt" files only produced the header while the rest is blank. I was asking if I should lower the rMATS version to 4.1.1, or is 4.3.0 shouldn't be making any of these issues?
I'll attach some files here to show you what I meant (all the rMATS files as .txt, and RI-focused results just to not overload them, and since that one's specific to our research): 2024-10-30-18_52_07_654130_read_outcomes_by_bam.txt 2024-10-30-18_52_07_654130_0.txt 2024-10-30-18_52_07_654130_1.txt 2024-10-30-18_52_07_654130_2.txt 2024-10-30-18_52_07_654130_3.txt 2024-10-30-18_52_07_654130_4.txt 2024-10-30-18_52_07_654130_5.txt 2024-10-30-18_52_07_654130_6.txt 2024-10-30-18_52_07_654130_7.txt 2024-10-30-18_52_07_654130_8.txt 2024-10-30-18_52_07_654130_9.txt summary.txt fromGTF.RI.txt fromGTF.novelSpliceSite.RI.txt fromGTF.novelJunction.RI.txt JC.raw.input.RI.txt JCEC.raw.input.RI.txt RI.MATS.JC.txt RI.MATS.JCEC.txt
From the read outcomes file, only about 2,000 reads were USED out of about 800 million. About 65% were filtered for NOT_EXPECTED_READ_LENGTH
and about 10% were filtered for CLIPPED
If your reads are a fixed length you can double check the value of --readLength
. You could also use --variable-read-length
and --allow-clipping
Hello, I find it really weird to obtain empty files in results folder (A3SS.MATS.JC/JCEC, etc.) while no Error is poping up, and I am seeing in my slurm.out the following :
Only the " fromGTF." files per splicing type are not empty. The others have only headers without any gene record.
Input are BAM files from STAR alignment, gtf is given same as one for STAR, there are 3 replicates per condition, and I run in cluster with 16 cpus-per-task.
My command :
Could sb give me any hints on what may be happening and how to find out ?