Xinglab / rmats-turbo

Other
219 stars 53 forks source link

Single sample (triplicate) analysis #222

Open xll-marker opened 2 years ago

xll-marker commented 2 years ago

Hi, I have some questions about using RMATS

  1. If I only want to test how much AS is contained in a sample (containing 3 biological replicates) instead of differential variable shear, the following code is feasible:

python rmats.py --b1 ./b1.txt --gtf merge. gtf -t paired --readLength 150 --nthread 4 --od ./out --tmp ./tmp --statoff

With the above code running, how can I further filter the results for reliable AS events

  1. I have 10 samples, each containing 3 biological replicates, can I put the BAM file path of all samples into b1.txt file, and then run the above code to obtain the total number of AS events in 10 samples? How can I further filter the obtained AS events once obtained
EricKutschera commented 2 years ago

Yes, you can put all your BAM file paths into b1.txt to get all the AS events

This post has some discussion of event filtering: https://github.com/Xinglab/rmats-turbo/issues/183 In your case there is no PValue or FDR, but you can still filter with the count columns and inclusion levels

xll-marker commented 2 years ago

Thank you very much for your answer, but I still have two questions.

  1. When conducting AS analysis, we need to use the annotation file of reference genome (GTF file), so can I use Hisat and stringtie to assemble the annotation file of all sample sequencing results, or what annotation file should I use AS a reference

2.When I conducted AS analysis, why did JC and JCEC results of various types of AS events differ very little from each other? EventType TotalEventsJC TotalEventsJCEC SignificantEventsJC SigEventsJCSample1HigherInclusion SigEventsJCSample2HigherInclusion SignificantEventsJCEC SigEventsJCECSample1HigherInclusion SigEventsJCECSample2HigherInclusion

EventType TotalEventsJC TotalEventsJCEC SignificantEventsJC SigEventsJCSample1HigherInclusion SE 13712 13758 489 244 245 499 248 251 A5SS 4378 4382 288 149 139 288 145 143 A3SS 7545 7555 414 199 215 426 203 223 MXE 846 847 22 9 13 13 5 8 RI 7922 7938 466 229 237 474 237 237

EricKutschera commented 2 years ago

The --gtf argument is used by rmats when it is reading the alignments from the input bam files and also when defining the splicing events in the output. If the exon coordinates in the gtf file don't line up with the coordinates in the bam files then rmats will ignore those reads. The printed output includes a line with JUNCTION_NOT_MATCHED_TO_ANNOTATION to show how many reads were ignored.

rmats uses both the transcripts in the gtf and the reads to define splicing events. If you give rmats a gtf with more transcripts then you may get more events in the output

You could use a reference gtf that is compatible with the reference genome used to align your reads. For example https://www.gencodegenes.org/human/ provides a .gtf and .fasta that use the same coordinates. You could also get a .gtf from your aligned reads using stringtie. In that case the gtf is based on the alignments and rmats should report fewer JUNCTION_NOT_MATCHED_TO_ANNOTATION. I think the stringtie gtf will include some "novel" transcripts that are specific to your reads and that may lead to some additional novel events in the output

For JC and JCEC results, this post has some details: https://github.com/Xinglab/rmats-turbo/issues/45 The JCEC file considers some exon reads that are not considered in the JC file. The additional counts can change the result

xll-marker commented 2 years ago

Thank you very much for solving my problem. However, when I got the results, I found that there were some problems in the results: Here is the code I executed: python rmats.py --b1 ./b1.txt --gtf merge.gtf -t paired --readLength 150 --nthread 4 --od ./out --tmp ./tmp --statoff --novelSS --allow-clipping --variable-read-length

  1. I found that the values of IJC_SAMPLE_1 and SJC_SAMPLE_1 for about 200 transcripts did not show three replicates (the input data had three biological replicates), but were separated by four commas (66,223,112,334), and when I selected this value, the commas disappeared, The values in the table are merged (66223112334). I would like to know why this problem occurs and whether these values can continue to be used?

2, when I use the above code it AS the event, the single sample analysis found that A3ss type AS the largest events, my sample is plants, not animals, according to previous studies generally, in most plants AS event types should be IR,I want to know why there are different results when I use RMATS to analyze plants. If I want to continue to use RMATS to analyze plant AS events, what parameters do I need to change, or do I need to use different screening criteria to filter different types of AS events? How do I use rmats to accurately analyze and screen AS events in plants.

EricKutschera commented 2 years ago

I'm not sure why the count columns had an extra entry. Can you double check your b1.txt?

when I selected this value, the commas disappeared

^That sounds like an issue with the program you are using to view the results. If you re-run with new --od and --tmp directories and then view the files from the command line (with less or cat) do the files look correct?

See this post about RI events: https://github.com/Xinglab/rmats-turbo/issues/17 rmats requires most of the information for RI events to come from the gtf. It might be that your merge.gtf does not have the necessary information

xll-marker commented 2 years ago

Hello! I still have some questions to ask you

So I've got the variable splicing information of the gene, which contains the information about the exon that was spliced, so how do I use the exon information to determine which one is the original transcript, how do I get the ID of the gene that was and was not spliced

EricKutschera commented 2 years ago

The rmats output includes a GeneID column and also other columns with the coordinates for the exons. The columns are described in the README: https://github.com/Xinglab/rmats-turbo/tree/v4.1.2#output

I'm not sure exactly what you're looking for, but maybe searching those coordinates with the UCSC genome browser will show the genes/transcripts that you are interested in: https://genome.ucsc.edu/cgi-bin/hgGateway

xll-marker commented 1 year ago

Hi, I have some questions about the A5SS and A3SS results:

My A5SS output contains the following two columns:

GeneID longExonStart_0base longExonEnd shortES shortEE flankingES flankingEE MSTRG.104085 153810526 153810784 153810542 153810784 153608055 153608219

MSTRG.118818 73686 73926 73686 73872 75567 75692

As above, the position of longExonEnd and shortEE in the first column is the same, and the position of longExonStart_0base and shortES in the second column is the same. Why does A5SS contain these two cases, and the type result of A3SS is the same as A5SS, which also contains the above two kinds

EricKutschera commented 1 year ago

The A5SS and A3SS events are shown in the diagram at: https://github.com/Xinglab/rmats-turbo/tree/v4.1.2#output

In the diagram for A5SS, the short exon is the black rectangle on the left and the long exon is the combination of both the black and white rectangles on the left. That diagram shows why longExonStart_0base and shortES would be the same. Similarly the A3SS diagram shows why longExonEnd and shortEE would be the same

If I understand your question correctly, you are asking why the A5SS file would have both of those cases since the diagram only shows one of the cases. The reason that both cases are in both the A5SS file and A3SS file is that the diagram is only showing how things look for the "+" strand. For "-" strand events, the diagram needs to be flipped and so the A5SS event ends up looking like the A3SS event. The output file combines both "+" and "-" strand events in the same file, but it always reports the numerically lower coordinate as the start coordinate even though the lower coordinate might not be conceptually first in terms of transcription. You can use the strand column in the output to determine how to interpret the coordinates

xll-marker commented 1 year ago

hello!

Can I summarize the A5SS file and the A3SS file, and then reclassify the A5SS and A3SS events by "+" and "-" chains, for example, define the "-" chain as an A3SS event, if not, how can I convert the "-" chain location information in the A5SS event result into a "+" chain.

EricKutschera commented 1 year ago

When processing the output files you can group the + strand A5SS events with the - strand A3SS events if that makes things easier. But to be clear all the events in the A5SS file are actually 5' splice site events which are different than 3' splice site events in terms of how the RNA was transcribed and spliced. It's just that in terms of coordinates the A5SS + strand events look the same as the A3SS - strand events in the output file

xll-marker commented 1 year ago

Hello! I want to use the se_to_isoforms.py script to extract the subtype where AS occurs, but the script is not running, could you please show me how to use the full command of the script correctly

EricKutschera commented 1 year ago

You can run python se_to_isoforms.py --help to see what the arguments are and what column value to use for each argument

I ran based on an output SE.MATS.JC.txt with this row:

ID  GeneID  geneSymbol  chr strand  exonStart_0base exonEnd upstreamES  upstreamEE  downstreamES    downstreamEE
80  "ENSG00000214827.11"    "MTCP1" chrX    -   155065905   155066057   155065618   155065789   155070693   155071136

python se_to_isoforms.py --gtf ./gencode.v41.annotation.gtf --chr chrX --strand - --exon-start 155065905 --exon-end 155066057 --upstream-start 155065618 --upstream-end 155065789 --downstream-start 155070693 --downstream-end 155071136

Here's the output I got from that script:

transcript_id: "ENST00000362018.2", gene_id: "ENSG00000214827.11", gene_name: "MTCP1", uses_up_junction: True, uses_down_junction: False, uses_skip_junction: False, exons: [(155065619, 155065789), (155065906, 155066057)]
transcript_id: "ENST00000369476.8", gene_id: "ENSG00000214827.11", gene_name: "MTCP1", uses_up_junction: True, uses_down_junction: True, uses_skip_junction: False, exons: [(155065619, 155065789), (155065906, 155066057), (155070694, 155071136)]
transcript_id: "ENST00000482244.5", gene_id: "ENSG00000214827.11", gene_name: "MTCP1", uses_up_junction: False, uses_down_junction: False, uses_skip_junction: True, exons: [(155065619, 155065789), (155070694, 155071136)]

(The script is from this post https://github.com/Xinglab/rmats-turbo/issues/148)

xll-marker commented 1 year ago

Hello!

I want to count whether the position of all the AS occurs in the CDS or UTR region, and the splice point type of all the AS, to determine whether it is GT-AG or other types, how do I do it?

EricKutschera commented 1 year ago

I don't know of any existing code to do that. I think you could write a script that uses the coordinates from the rmats output files and looks up the other info in reference files. The .gtf and .fasta files at https://www.gencodegenes.org/human/ have coordinates for CDS and UTR regions and also the sequence information