Xinglab / rMATS-long

Other
23 stars 2 forks source link

Some doesn't have output in *isoform_differences* file #22

Closed kwonej0617 closed 4 months ago

kwonej0617 commented 4 months ago

Hi, @EricKutschera

I have noticed that some files which has the following format of file name doesn't have any contents. I wonder which case leads to this type of outcome. HSALNG0111608_isoform_differences_HSALNT0386195_to_HSALNT0231493.tsv

transcript1 transcript2 event   coordinates

Here is the detailed information for the gene from 'differential_transcrripts.tsv'.

grep HSALNG0111608 differential_transcripts.tsv | sort -k15,15n
HSALNG0111608   HSALNT0386195   14.4604034812426    1   0.00014313704477147 0.0194884198131242  0.0 0.0 0.0 0.1009  0.1344  0.0797  0.0 0.105   -0.105
HSALNG0111608   ESPRESSO:chr16:6067:0   2.40012700379248    1   0.12132540008651    0.962851812741984   0.0 0.0161  0.0 0.0194  0.0674  0.0595  0.0054  0.0488  -0.0434
HSALNG0111608   HSALNT0386207   -0.208993510489108  1   1   1   0.1373  0.1774  0.1053  0.0486  0.258   0.2261  0.14    0.1776  -0.0375
HSALNG0111608   HSALNT0386228   3.29208470980211    1   0.0696145823224435  0.797847286884918   0.0196  0.0161  0.158   0.0874  0.1236  0.0952  0.0646  0.1021  -0.0375
HSALNG0111608   HSALNT0386211   -1.01629387856269   1   1   1   0.0098  0.0161  0.0 0.0 0.0337  0.0238  0.0086  0.0192  -0.0105
HSALNG0111608   HSALNT0231476   -0.0272103461006736 1   1   1   0.0041  0.0161  0.0 0.0141  0.0115  0.0107  0.0068  0.0121  -0.0053
HSALNG0111608   HSALNT0231489   -0.541935694916447  1   1   1   0.0041  0.0161  0.0 0.0141  0.0115  0.0107  0.0068  0.0121  -0.0053
HSALNG0111608   HSALNT0386199   -0.679174928450266  1   1   1   0.0041  0.0161  0.0 0.0141  0.0115  0.0107  0.0068  0.0121  -0.0053
HSALNG0111608   HSALNT0231478   -0.0298026612058493 1   1   1   0.0041  0.0161  0.0263  0.0141  0.0115  0.0107  0.0155  0.0121  0.0034
HSALNG0111608   HSALNT0231480   -0.0298026613525053 1   1   1   0.0041  0.0161  0.0263  0.0141  0.0115  0.0107  0.0155  0.0121  0.0034
HSALNG0111608   HSALNT0386197   -1.35397186576574   1   1   1   0.0392  0.0 0.0 0.0097  0.0 0.0119  0.0131  0.0072  0.0059
HSALNG0111608   HSALNT0231482   0.243853913495741   1   0.621436367179971   1   0.0785  0.0161  0.0 0.0194  0.0 0.0357  0.0315  0.0184  0.0132
HSALNG0111608   HSALNT0231484   -0.206384296031956  1   1   1   0.1614  0.2646  0.316   0.3027  0.1344  0.1596  0.2473  0.1989  0.0484
HSALNG0111608   HSALNT0231481   0.769342646357472   1   0.380420486774866   1   0.1614  0.2646  0.316   0.2018  0.1344  0.1596  0.2473  0.1653  0.0821
HSALNG0111608   HSALNT0231493   13.0398096447764    1   0.000304939088968898    0.0364936340295326  0.3231  0.1323  0.0 0.1009  0.0 0.0 0.1518  0.0336  0.1182

Thank you. Looking forward to hearing from you.

EricKutschera commented 4 months ago

If the isoform differences file is empty (just the header) then the two transcripts differ only by the endpoints

HSALNT0231493 and HSALNT0386195 have the same sequence of splice junctions. They only differ by the start and end coordinates. From https://ngdc.cncb.ac.cn/lncbook/files/lncRNA_LncBookv2.0_GRCh38.gtf.gz the exon coordinates are

HSALNT0231493: 54919057-54919209, 54920298-54920410, 54923585-54923650, 54925101-54925213, 54928770-54929153

HSALNT0386195: 54917968-54919209, 54920298-54920410, 54923585-54923650, 54925101-54925213, 54928770-54928838

The script that outputs the isoform differences doesn't consider the transcript start or end coordinates: https://github.com/Xinglab/rMATS-long/blob/v1.0.0/scripts/FindAltTSEvents.py#L173

ESPRESSO won't detect a novel transcript that differs from an annotated transcript by only the start or end coordinate. However, if the annotation includes transcripts that differ by only the start and end then ESPRESSO will distinguish those transcripts when assigning reads. Since those two transcripts were annotated it was possible for them to end up in the rMATS-long output

kwonej0617 commented 4 months ago

Thank you! Also, I wonder if reads determined as NCD (not completely determined) are used in calculating abundance and further in the analysis of rMATS-long? If so, is there a way to exclude those reads with NCD in the downstream analysis?

86d349ba-51ee-4227-a08c-cf1f125e0983    HEK293T_rep2 NCD     ENST00000361007.7,
d7e05caa-fbe3-43c5-abfe-648773ec0572    HEK293T_rep2 NCD     ENST00000361007.7,
cea6cd62-58e9-49d7-8784-7c296fc4349e    HEK293T_rep1 NCD     NA
6765fa1d-f252-463d-9daf-d99f588e18b9    HEK293T_rep3 NCD     NA
7e2b43de-be0e-4238-be12-8010dcb1b7b2    HEK293T-rep2 NCD     NA
83bf89c1-24c4-477d-9d19-1b540b521176    HEK293T_rep2 NCD     ENST00000648446.1,ENST00000361007.7,

Thank you!

EricKutschera commented 4 months ago

The last column of the -V, --tsv_compt file shows the compatible isoforms for that read. Unless that column is NA, the read is included in the final abundance. Then rMATS-long uses the ESPRESSO abundance file so it will include those NCD reads that have compatible isoforms

ESPRESSO doesn't have an option to exclude NCD reads from the abundance calculation. You could get all the NCD read IDs from the compatible isoform output file and then remove those reads from your input files and re-run ESPRESSO. Or you could remove those read IDs from the *_read_final.txt files and re-run just the Q step

kwonej0617 commented 4 months ago

Thank you!