Xinglab / rMATS-long

Other
23 stars 2 forks source link

compare the structures of isoforms within a gene #20

Open Oliverfeudj opened 5 months ago

Oliverfeudj commented 5 months ago

Hello @EricKutschera,

is there a way to do this classify_isoform_differences.py for all isoforms? not doing individually for each isoform so that I have at the end all isoform in a single tsv file, not individual tsv for each isoform

Thank you in advance for your reply

EricKutschera commented 5 months ago

By default rmats_long.py outputs a separate .tsv for each of the most significant isoforms showing the differences of that isoform to the most significant isoform in that gene with a delta proportion in the opposite direction

If you run rmats_long.py with --compare-all-within-gene then when it calls classify_isoform_differences.py it won't use --second-transcript-id and each .tsv will have the differences of the most significant isoform to all other isoforms in the gene

I added a branch which changes classify_isoform_differences.py to do a pairwise comparison of isoforms in a gene when run without --second-transcript-id: https://github.com/Xinglab/rMATS-long/commit/bc2bdfb59972ad7090047024b006d9c7128b7b44

Doing a pairwise comparison can take a long time for genes with many isoforms

kwonej0617 commented 4 months ago

Hi @EricKutschera

I have used --compare-all-within-gene when running rmats_long.py and I have a question regarding the output. Here's the command line I used:

python ./rMATS-long/scripts/rmats_long.py --abundance espresso_data/HEK293T-all/HEK293T-all_N2_R0_abundance.esp --updated-gtf espresso_data/HEK293T-all/HEK293T-all_N2_R0_updated.gtf --gencode-gtf ${gtf} --group-1 rMATS-long_data/input/HEK293T-all_group1.txt --group-2 rMATS-long_data/input/HEK293T-all_group2.txt --group-1-name WT --group-2-name KO --out-dir ${output_path} --plot-file-type .png --compare-all-within-gene

In the differential_transcript.tsv, the lines containing ENSG00000004455.17 (just randomly picked) were 10 lines as follows:

grep ENSG00000004455.17 rMATS-long_data/differential_transcripts.tsv 
ENSG00000004455.17  ENST00000354858.11  0.281545888814435   1   0.595689772079638   1   0.0064  0.0129  0.0081  0.0066  0.0073  0.0047  0.0091  0.0062  0.0029
ENSG00000004455.17  ENST00000373449.7   56.7319289594579    1   4.99459681293473e-14    2.34608698795577e-10    0.3385  0.354   0.4015  0.529   0.6199  0.6675  0.3647  0.6055  -0.2408
ENSG00000004455.17  ENST00000467905.5   1.41173459743368    1   0.234768653534498   1   0.0128  0.0102  0.0079  0.0 0.0036  0.0093  0.0103  0.0043  0.006
ENSG00000004455.17  ENST00000480134.5   0.462403563942644   1   0.496502801619222   1   0.1079  0.0688  0.0312  0.0548  0.0578  0.0794  0.0693  0.064   0.0053
ENSG00000004455.17  ENST00000487289.1   12.9953227365513    1   0.000312270022764174    0.0371344396057979  0.0222  0.0203  0.0391  0.0032  0.0 0.0 0.0272  0.0011  0.0261
ENSG00000004455.17  ENST00000548033.5   4.71189004554435    1   0.0299547012700919  0.555825341913299   0.0 0.0025  0.0 0.0064  0.0108  0.0187  0.0008  0.01-0.0111
ENSG00000004455.17  ENST00000550338.5   1.87958598265686    1   0.170381210967297   1   0.019   0.0228  0.0469  0.0128  0.0144  0.014   0.0296  0.0137  0.0159
ENSG00000004455.17  ENST00000629371.2   3.9830445464795 1   0.0459604191091984  0.679867100930521   0.0 0.0 0.0 0.0032  0.0109  0.0094  0.0 0.0079  -0.0079
ENSG00000004455.17  ENST00000672715.1   49.5163324181758    1   1.96727117889871e-12    7.39261163606559e-09    0.4766  0.4792  0.4491  0.3296  0.2418  0.1822  0.4683  0.2512  0.2171
ENSG00000004455.17  ESPRESSO:chr1:376:3 1.94993508482912    1   0.162593846125627   1   0.0165  0.0241  0.0163  0.0544  0.0335  0.0147  0.019   0.0342  -0.0152

I found the structure figure ENSG00000004455.17_structure.png only includes 5 transcripts. I wonder why it only has 5 transcripts, not all transcripts shown in differential_transcript.tsv? image

In addition, I have another question. Could you please explain How the number next to the event type (for example, exon skipping 75) are calculated? image

Thank you so much for your help.

EricKutschera commented 4 months ago

visualize_isoforms.py has a parameter --max-transcripts which defaults to 5: https://github.com/Xinglab/rMATS-long/blob/v1.0.0/scripts/visualize_isoforms.py#L82

If you want to plot more than 5 then you'll need to add more colors here: https://github.com/Xinglab/rMATS-long/blob/v1.0.0/scripts/visualize_isoforms.py#L17

Here's the code to get the number for each event type: https://github.com/Xinglab/rMATS-long/blob/592cb3268d16aa6bea3a6b79aedceac128563e3b/scripts/rmats_long.py#L517

It looks at the isoform_differences files and it checks each pair of transcripts to see what splicing events were found between those two transcripts. If there is only 1 event for a pair of transcripts then the count for that event type will increase. If there are multiple events for a pair then it will count as combinatorial

kwonej0617 commented 4 months ago

Thank you @EricKutschera!

Also, I have a question. Based on the summary figure, I have 20 differential isoform pairs that has SE event. image However, the number of 'isoform_differences' files are greater than 20. There are 29 files. I wonder why the numbers are different, in other word, what is the logic to keep those 20 SE event cases? Also, is there a way to know what those 20 cases are?

Looking forward to hearing from you. Thank you!

EricKutschera commented 4 months ago

If the plot shows 20 SE events then after looking at each of the isoform differences files and adding up all the pairs of transcripts that only had a single SE difference the count should be 20. If you ran with --compare-all-within-gene then each file can have multiple pairs of transcripts. The number of isoform differences files isn't expected to match the number of events

kwonej0617 commented 3 months ago

Hi @EricKutschera

I have a question regarding the number of total genes with significant isoform. Based on my summary.txt file, it appears the total genes with significant isoforms are 82, which is much lower than I expected. Considering there are more than 20,000 genes are expressed in the cell and 95% of them undergoes alternative splicing, I thought the number should be higher than 82. I wonder this number is reasonable or do you think I should change the cutoff (p-value or delta threshold)?

## significant differential transcript usage
total significant isoforms: 117
total genes with significant isoforms: 82
adjusted pvalue threshold: 0.05
delta isoform proportion threshold: 0.1
## alternative splicing classifications between isoform pairs
total classified isoform pairs: 1062
exon skipping: 78
alternative 5'-splice site: 35
alternative 3'-splice site: 29
mutually exclusive exons: 4
intron retention: 49
alternative first exon: 30
alternative last exon: 12
complex: 346
combinatorial: 479

Looking forward to hearing from you. Thank you!

EricKutschera commented 3 months ago

The number of significant genes depends on the data and 82 could be reasonable for your data. rMATS-long is looking for significant splicing differences between the two groups. There might be 20k genes with alternative splicing in your data, but that splicing might not be significantly different between your two groups. It could be that the isoform proportions in the two groups are similar, or it could be that there aren't enough reads for the difference to be reported as significant