Open Oliverfeudj opened 5 months ago
By default rmats_long.py
outputs a separate .tsv for each of the most significant isoforms showing the differences of that isoform to the most significant isoform in that gene with a delta proportion in the opposite direction
If you run rmats_long.py
with --compare-all-within-gene
then when it calls classify_isoform_differences.py
it won't use --second-transcript-id
and each .tsv will have the differences of the most significant isoform to all other isoforms in the gene
I added a branch which changes classify_isoform_differences.py
to do a pairwise comparison of isoforms in a gene when run without --second-transcript-id
: https://github.com/Xinglab/rMATS-long/commit/bc2bdfb59972ad7090047024b006d9c7128b7b44
Doing a pairwise comparison can take a long time for genes with many isoforms
Hi @EricKutschera
I have used --compare-all-within-gene when running rmats_long.py and I have a question regarding the output. Here's the command line I used:
python ./rMATS-long/scripts/rmats_long.py --abundance espresso_data/HEK293T-all/HEK293T-all_N2_R0_abundance.esp --updated-gtf espresso_data/HEK293T-all/HEK293T-all_N2_R0_updated.gtf --gencode-gtf ${gtf} --group-1 rMATS-long_data/input/HEK293T-all_group1.txt --group-2 rMATS-long_data/input/HEK293T-all_group2.txt --group-1-name WT --group-2-name KO --out-dir ${output_path} --plot-file-type .png --compare-all-within-gene
In the differential_transcript.tsv, the lines containing ENSG00000004455.17 (just randomly picked) were 10 lines as follows:
grep ENSG00000004455.17 rMATS-long_data/differential_transcripts.tsv
ENSG00000004455.17 ENST00000354858.11 0.281545888814435 1 0.595689772079638 1 0.0064 0.0129 0.0081 0.0066 0.0073 0.0047 0.0091 0.0062 0.0029
ENSG00000004455.17 ENST00000373449.7 56.7319289594579 1 4.99459681293473e-14 2.34608698795577e-10 0.3385 0.354 0.4015 0.529 0.6199 0.6675 0.3647 0.6055 -0.2408
ENSG00000004455.17 ENST00000467905.5 1.41173459743368 1 0.234768653534498 1 0.0128 0.0102 0.0079 0.0 0.0036 0.0093 0.0103 0.0043 0.006
ENSG00000004455.17 ENST00000480134.5 0.462403563942644 1 0.496502801619222 1 0.1079 0.0688 0.0312 0.0548 0.0578 0.0794 0.0693 0.064 0.0053
ENSG00000004455.17 ENST00000487289.1 12.9953227365513 1 0.000312270022764174 0.0371344396057979 0.0222 0.0203 0.0391 0.0032 0.0 0.0 0.0272 0.0011 0.0261
ENSG00000004455.17 ENST00000548033.5 4.71189004554435 1 0.0299547012700919 0.555825341913299 0.0 0.0025 0.0 0.0064 0.0108 0.0187 0.0008 0.01-0.0111
ENSG00000004455.17 ENST00000550338.5 1.87958598265686 1 0.170381210967297 1 0.019 0.0228 0.0469 0.0128 0.0144 0.014 0.0296 0.0137 0.0159
ENSG00000004455.17 ENST00000629371.2 3.9830445464795 1 0.0459604191091984 0.679867100930521 0.0 0.0 0.0 0.0032 0.0109 0.0094 0.0 0.0079 -0.0079
ENSG00000004455.17 ENST00000672715.1 49.5163324181758 1 1.96727117889871e-12 7.39261163606559e-09 0.4766 0.4792 0.4491 0.3296 0.2418 0.1822 0.4683 0.2512 0.2171
ENSG00000004455.17 ESPRESSO:chr1:376:3 1.94993508482912 1 0.162593846125627 1 0.0165 0.0241 0.0163 0.0544 0.0335 0.0147 0.019 0.0342 -0.0152
I found the structure figure ENSG00000004455.17_structure.png only includes 5 transcripts. I wonder why it only has 5 transcripts, not all transcripts shown in differential_transcript.tsv?
In addition, I have another question. Could you please explain How the number next to the event type (for example, exon skipping 75) are calculated?
Thank you so much for your help.
visualize_isoforms.py has a parameter --max-transcripts which defaults to 5: https://github.com/Xinglab/rMATS-long/blob/v1.0.0/scripts/visualize_isoforms.py#L82
If you want to plot more than 5 then you'll need to add more colors here: https://github.com/Xinglab/rMATS-long/blob/v1.0.0/scripts/visualize_isoforms.py#L17
Here's the code to get the number for each event type: https://github.com/Xinglab/rMATS-long/blob/592cb3268d16aa6bea3a6b79aedceac128563e3b/scripts/rmats_long.py#L517
It looks at the isoform_differences files and it checks each pair of transcripts to see what splicing events were found between those two transcripts. If there is only 1 event for a pair of transcripts then the count for that event type will increase. If there are multiple events for a pair then it will count as combinatorial
Thank you @EricKutschera!
Also, I have a question. Based on the summary figure, I have 20 differential isoform pairs that has SE event. However, the number of 'isoform_differences' files are greater than 20. There are 29 files. I wonder why the numbers are different, in other word, what is the logic to keep those 20 SE event cases? Also, is there a way to know what those 20 cases are?
Looking forward to hearing from you. Thank you!
If the plot shows 20 SE events then after looking at each of the isoform differences files and adding up all the pairs of transcripts that only had a single SE difference the count should be 20. If you ran with --compare-all-within-gene
then each file can have multiple pairs of transcripts. The number of isoform differences files isn't expected to match the number of events
Hi @EricKutschera
I have a question regarding the number of total genes with significant isoform. Based on my summary.txt file, it appears the total genes with significant isoforms are 82, which is much lower than I expected. Considering there are more than 20,000 genes are expressed in the cell and 95% of them undergoes alternative splicing, I thought the number should be higher than 82. I wonder this number is reasonable or do you think I should change the cutoff (p-value or delta threshold)?
## significant differential transcript usage
total significant isoforms: 117
total genes with significant isoforms: 82
adjusted pvalue threshold: 0.05
delta isoform proportion threshold: 0.1
## alternative splicing classifications between isoform pairs
total classified isoform pairs: 1062
exon skipping: 78
alternative 5'-splice site: 35
alternative 3'-splice site: 29
mutually exclusive exons: 4
intron retention: 49
alternative first exon: 30
alternative last exon: 12
complex: 346
combinatorial: 479
Looking forward to hearing from you. Thank you!
The number of significant genes depends on the data and 82 could be reasonable for your data. rMATS-long is looking for significant splicing differences between the two groups. There might be 20k genes with alternative splicing in your data, but that splicing might not be significantly different between your two groups. It could be that the isoform proportions in the two groups are similar, or it could be that there aren't enough reads for the difference to be reported as significant
Hello @EricKutschera,
is there a way to do this
classify_isoform_differences.py
for all isoforms? not doing individually for each isoform so that I have at the end all isoform in a single tsv file, not individual tsv for each isoformThank you in advance for your reply