Xinglab / rMATS-long

Other
19 stars 2 forks source link

compare the structures of isoforms within a gene #20

Open Oliverfeudj opened 1 month ago

Oliverfeudj commented 1 month ago

Hello @EricKutschera,

is there a way to do this classify_isoform_differences.py for all isoforms? not doing individually for each isoform so that I have at the end all isoform in a single tsv file, not individual tsv for each isoform

Thank you in advance for your reply

EricKutschera commented 1 month ago

By default rmats_long.py outputs a separate .tsv for each of the most significant isoforms showing the differences of that isoform to the most significant isoform in that gene with a delta proportion in the opposite direction

If you run rmats_long.py with --compare-all-within-gene then when it calls classify_isoform_differences.py it won't use --second-transcript-id and each .tsv will have the differences of the most significant isoform to all other isoforms in the gene

I added a branch which changes classify_isoform_differences.py to do a pairwise comparison of isoforms in a gene when run without --second-transcript-id: https://github.com/Xinglab/rMATS-long/commit/bc2bdfb59972ad7090047024b006d9c7128b7b44

Doing a pairwise comparison can take a long time for genes with many isoforms

kwonej0617 commented 1 week ago

Hi @EricKutschera

I have used --compare-all-within-gene when running rmats_long.py and I have a question regarding the output. Here's the command line I used:

python ./rMATS-long/scripts/rmats_long.py --abundance espresso_data/HEK293T-all/HEK293T-all_N2_R0_abundance.esp --updated-gtf espresso_data/HEK293T-all/HEK293T-all_N2_R0_updated.gtf --gencode-gtf ${gtf} --group-1 rMATS-long_data/input/HEK293T-all_group1.txt --group-2 rMATS-long_data/input/HEK293T-all_group2.txt --group-1-name WT --group-2-name KO --out-dir ${output_path} --plot-file-type .png --compare-all-within-gene

In the differential_transcript.tsv, the lines containing ENSG00000004455.17 (just randomly picked) were 10 lines as follows:

grep ENSG00000004455.17 rMATS-long_data/differential_transcripts.tsv 
ENSG00000004455.17  ENST00000354858.11  0.281545888814435   1   0.595689772079638   1   0.0064  0.0129  0.0081  0.0066  0.0073  0.0047  0.0091  0.0062  0.0029
ENSG00000004455.17  ENST00000373449.7   56.7319289594579    1   4.99459681293473e-14    2.34608698795577e-10    0.3385  0.354   0.4015  0.529   0.6199  0.6675  0.3647  0.6055  -0.2408
ENSG00000004455.17  ENST00000467905.5   1.41173459743368    1   0.234768653534498   1   0.0128  0.0102  0.0079  0.0 0.0036  0.0093  0.0103  0.0043  0.006
ENSG00000004455.17  ENST00000480134.5   0.462403563942644   1   0.496502801619222   1   0.1079  0.0688  0.0312  0.0548  0.0578  0.0794  0.0693  0.064   0.0053
ENSG00000004455.17  ENST00000487289.1   12.9953227365513    1   0.000312270022764174    0.0371344396057979  0.0222  0.0203  0.0391  0.0032  0.0 0.0 0.0272  0.0011  0.0261
ENSG00000004455.17  ENST00000548033.5   4.71189004554435    1   0.0299547012700919  0.555825341913299   0.0 0.0025  0.0 0.0064  0.0108  0.0187  0.0008  0.01-0.0111
ENSG00000004455.17  ENST00000550338.5   1.87958598265686    1   0.170381210967297   1   0.019   0.0228  0.0469  0.0128  0.0144  0.014   0.0296  0.0137  0.0159
ENSG00000004455.17  ENST00000629371.2   3.9830445464795 1   0.0459604191091984  0.679867100930521   0.0 0.0 0.0 0.0032  0.0109  0.0094  0.0 0.0079  -0.0079
ENSG00000004455.17  ENST00000672715.1   49.5163324181758    1   1.96727117889871e-12    7.39261163606559e-09    0.4766  0.4792  0.4491  0.3296  0.2418  0.1822  0.4683  0.2512  0.2171
ENSG00000004455.17  ESPRESSO:chr1:376:3 1.94993508482912    1   0.162593846125627   1   0.0165  0.0241  0.0163  0.0544  0.0335  0.0147  0.019   0.0342  -0.0152

I found the structure figure ENSG00000004455.17_structure.png only includes 5 transcripts. I wonder why it only has 5 transcripts, not all transcripts shown in differential_transcript.tsv? image

In addition, I have another question. Could you please explain How the number next to the event type (for example, exon skipping 75) are calculated? image

Thank you so much for your help.

EricKutschera commented 1 week ago

visualize_isoforms.py has a parameter --max-transcripts which defaults to 5: https://github.com/Xinglab/rMATS-long/blob/v1.0.0/scripts/visualize_isoforms.py#L82

If you want to plot more than 5 then you'll need to add more colors here: https://github.com/Xinglab/rMATS-long/blob/v1.0.0/scripts/visualize_isoforms.py#L17

Here's the code to get the number for each event type: https://github.com/Xinglab/rMATS-long/blob/592cb3268d16aa6bea3a6b79aedceac128563e3b/scripts/rmats_long.py#L517

It looks at the isoform_differences files and it checks each pair of transcripts to see what splicing events were found between those two transcripts. If there is only 1 event for a pair of transcripts then the count for that event type will increase. If there are multiple events for a pair then it will count as combinatorial