Open xll-marker opened 2 years ago
Yes, you can put all your BAM file paths into b1.txt to get all the AS events
This post has some discussion of event filtering: https://github.com/Xinglab/rmats-turbo/issues/183 In your case there is no PValue or FDR, but you can still filter with the count columns and inclusion levels
Thank you very much for your answer, but I still have two questions.
2.When I conducted AS analysis, why did JC and JCEC results of various types of AS events differ very little from each other? EventType TotalEventsJC TotalEventsJCEC SignificantEventsJC SigEventsJCSample1HigherInclusion SigEventsJCSample2HigherInclusion SignificantEventsJCEC SigEventsJCECSample1HigherInclusion SigEventsJCECSample2HigherInclusion
EventType TotalEventsJC TotalEventsJCEC SignificantEventsJC SigEventsJCSample1HigherInclusion SE 13712 13758 489 244 245 499 248 251 A5SS 4378 4382 288 149 139 288 145 143 A3SS 7545 7555 414 199 215 426 203 223 MXE 846 847 22 9 13 13 5 8 RI 7922 7938 466 229 237 474 237 237
The --gtf argument is used by rmats when it is reading the alignments from the input bam files and also when defining the splicing events in the output. If the exon coordinates in the gtf file don't line up with the coordinates in the bam files then rmats will ignore those reads. The printed output includes a line with JUNCTION_NOT_MATCHED_TO_ANNOTATION
to show how many reads were ignored.
rmats uses both the transcripts in the gtf and the reads to define splicing events. If you give rmats a gtf with more transcripts then you may get more events in the output
You could use a reference gtf that is compatible with the reference genome used to align your reads. For example https://www.gencodegenes.org/human/ provides a .gtf and .fasta that use the same coordinates. You could also get a .gtf from your aligned reads using stringtie. In that case the gtf is based on the alignments and rmats should report fewer JUNCTION_NOT_MATCHED_TO_ANNOTATION. I think the stringtie gtf will include some "novel" transcripts that are specific to your reads and that may lead to some additional novel events in the output
For JC and JCEC results, this post has some details: https://github.com/Xinglab/rmats-turbo/issues/45 The JCEC file considers some exon reads that are not considered in the JC file. The additional counts can change the result
Thank you very much for solving my problem. However, when I got the results, I found that there were some problems in the results: Here is the code I executed: python rmats.py --b1 ./b1.txt --gtf merge.gtf -t paired --readLength 150 --nthread 4 --od ./out --tmp ./tmp --statoff --novelSS --allow-clipping --variable-read-length
2, when I use the above code it AS the event, the single sample analysis found that A3ss type AS the largest events, my sample is plants, not animals, according to previous studies generally, in most plants AS event types should be IR,I want to know why there are different results when I use RMATS to analyze plants. If I want to continue to use RMATS to analyze plant AS events, what parameters do I need to change, or do I need to use different screening criteria to filter different types of AS events? How do I use rmats to accurately analyze and screen AS events in plants.
I'm not sure why the count columns had an extra entry. Can you double check your b1.txt?
when I selected this value, the commas disappeared
^That sounds like an issue with the program you are using to view the results. If you re-run with new --od and --tmp directories and then view the files from the command line (with less
or cat
) do the files look correct?
See this post about RI events: https://github.com/Xinglab/rmats-turbo/issues/17 rmats requires most of the information for RI events to come from the gtf. It might be that your merge.gtf does not have the necessary information
Hello! I still have some questions to ask you
So I've got the variable splicing information of the gene, which contains the information about the exon that was spliced, so how do I use the exon information to determine which one is the original transcript, how do I get the ID of the gene that was and was not spliced
The rmats output includes a GeneID
column and also other columns with the coordinates for the exons. The columns are described in the README: https://github.com/Xinglab/rmats-turbo/tree/v4.1.2#output
I'm not sure exactly what you're looking for, but maybe searching those coordinates with the UCSC genome browser will show the genes/transcripts that you are interested in: https://genome.ucsc.edu/cgi-bin/hgGateway
Hi, I have some questions about the A5SS and A3SS results:
My A5SS output contains the following two columns:
GeneID longExonStart_0base longExonEnd shortES shortEE flankingES flankingEE MSTRG.104085 153810526 153810784 153810542 153810784 153608055 153608219
MSTRG.118818 73686 73926 73686 73872 75567 75692
As above, the position of longExonEnd and shortEE in the first column is the same, and the position of longExonStart_0base and shortES in the second column is the same. Why does A5SS contain these two cases, and the type result of A3SS is the same as A5SS, which also contains the above two kinds
The A5SS and A3SS events are shown in the diagram at: https://github.com/Xinglab/rmats-turbo/tree/v4.1.2#output
In the diagram for A5SS, the short exon is the black rectangle on the left and the long exon is the combination of both the black and white rectangles on the left. That diagram shows why longExonStart_0base and shortES would be the same. Similarly the A3SS diagram shows why longExonEnd and shortEE would be the same
If I understand your question correctly, you are asking why the A5SS file would have both of those cases since the diagram only shows one of the cases. The reason that both cases are in both the A5SS file and A3SS file is that the diagram is only showing how things look for the "+" strand. For "-" strand events, the diagram needs to be flipped and so the A5SS event ends up looking like the A3SS event. The output file combines both "+" and "-" strand events in the same file, but it always reports the numerically lower coordinate as the start coordinate even though the lower coordinate might not be conceptually first in terms of transcription. You can use the strand column in the output to determine how to interpret the coordinates
hello!
Can I summarize the A5SS file and the A3SS file, and then reclassify the A5SS and A3SS events by "+" and "-" chains, for example, define the "-" chain as an A3SS event, if not, how can I convert the "-" chain location information in the A5SS event result into a "+" chain.
When processing the output files you can group the + strand A5SS events with the - strand A3SS events if that makes things easier. But to be clear all the events in the A5SS file are actually 5' splice site events which are different than 3' splice site events in terms of how the RNA was transcribed and spliced. It's just that in terms of coordinates the A5SS + strand events look the same as the A3SS - strand events in the output file
Hello! I want to use the se_to_isoforms.py script to extract the subtype where AS occurs, but the script is not running, could you please show me how to use the full command of the script correctly
You can run python se_to_isoforms.py --help
to see what the arguments are and what column value to use for each argument
I ran based on an output SE.MATS.JC.txt
with this row:
ID GeneID geneSymbol chr strand exonStart_0base exonEnd upstreamES upstreamEE downstreamES downstreamEE
80 "ENSG00000214827.11" "MTCP1" chrX - 155065905 155066057 155065618 155065789 155070693 155071136
python se_to_isoforms.py --gtf ./gencode.v41.annotation.gtf --chr chrX --strand - --exon-start 155065905 --exon-end 155066057 --upstream-start 155065618 --upstream-end 155065789 --downstream-start 155070693 --downstream-end 155071136
Here's the output I got from that script:
transcript_id: "ENST00000362018.2", gene_id: "ENSG00000214827.11", gene_name: "MTCP1", uses_up_junction: True, uses_down_junction: False, uses_skip_junction: False, exons: [(155065619, 155065789), (155065906, 155066057)]
transcript_id: "ENST00000369476.8", gene_id: "ENSG00000214827.11", gene_name: "MTCP1", uses_up_junction: True, uses_down_junction: True, uses_skip_junction: False, exons: [(155065619, 155065789), (155065906, 155066057), (155070694, 155071136)]
transcript_id: "ENST00000482244.5", gene_id: "ENSG00000214827.11", gene_name: "MTCP1", uses_up_junction: False, uses_down_junction: False, uses_skip_junction: True, exons: [(155065619, 155065789), (155070694, 155071136)]
(The script is from this post https://github.com/Xinglab/rmats-turbo/issues/148)
Hello!
I want to count whether the position of all the AS occurs in the CDS or UTR region, and the splice point type of all the AS, to determine whether it is GT-AG or other types, how do I do it?
I don't know of any existing code to do that. I think you could write a script that uses the coordinates from the rmats output files and looks up the other info in reference files. The .gtf and .fasta files at https://www.gencodegenes.org/human/ have coordinates for CDS and UTR regions and also the sequence information
Hi, I have some questions about using RMATS
python rmats.py --b1 ./b1.txt --gtf merge. gtf -t paired --readLength 150 --nthread 4 --od ./out --tmp ./tmp --statoff
With the above code running, how can I further filter the results for reliable AS events