Closed kwonej0617 closed 4 months ago
If the isoform differences file is empty (just the header) then the two transcripts differ only by the endpoints
HSALNT0231493 and HSALNT0386195 have the same sequence of splice junctions. They only differ by the start and end coordinates. From https://ngdc.cncb.ac.cn/lncbook/files/lncRNA_LncBookv2.0_GRCh38.gtf.gz the exon coordinates are
HSALNT0231493: 54919057-54919209, 54920298-54920410, 54923585-54923650, 54925101-54925213, 54928770-54929153
HSALNT0386195: 54917968-54919209, 54920298-54920410, 54923585-54923650, 54925101-54925213, 54928770-54928838
The script that outputs the isoform differences doesn't consider the transcript start or end coordinates: https://github.com/Xinglab/rMATS-long/blob/v1.0.0/scripts/FindAltTSEvents.py#L173
ESPRESSO won't detect a novel transcript that differs from an annotated transcript by only the start or end coordinate. However, if the annotation includes transcripts that differ by only the start and end then ESPRESSO will distinguish those transcripts when assigning reads. Since those two transcripts were annotated it was possible for them to end up in the rMATS-long output
Thank you! Also, I wonder if reads determined as NCD (not completely determined) are used in calculating abundance and further in the analysis of rMATS-long? If so, is there a way to exclude those reads with NCD in the downstream analysis?
86d349ba-51ee-4227-a08c-cf1f125e0983 HEK293T_rep2 NCD ENST00000361007.7,
d7e05caa-fbe3-43c5-abfe-648773ec0572 HEK293T_rep2 NCD ENST00000361007.7,
cea6cd62-58e9-49d7-8784-7c296fc4349e HEK293T_rep1 NCD NA
6765fa1d-f252-463d-9daf-d99f588e18b9 HEK293T_rep3 NCD NA
7e2b43de-be0e-4238-be12-8010dcb1b7b2 HEK293T-rep2 NCD NA
83bf89c1-24c4-477d-9d19-1b540b521176 HEK293T_rep2 NCD ENST00000648446.1,ENST00000361007.7,
Thank you!
The last column of the -V, --tsv_compt
file shows the compatible isoforms for that read. Unless that column is NA, the read is included in the final abundance. Then rMATS-long uses the ESPRESSO abundance file so it will include those NCD reads that have compatible isoforms
ESPRESSO doesn't have an option to exclude NCD reads from the abundance calculation. You could get all the NCD read IDs from the compatible isoform output file and then remove those reads from your input files and re-run ESPRESSO. Or you could remove those read IDs from the *_read_final.txt
files and re-run just the Q step
Thank you!
Hi, @EricKutschera
I have noticed that some files which has the following format of file name doesn't have any contents. I wonder which case leads to this type of outcome. HSALNG0111608_isoform_differences_HSALNT0386195_to_HSALNT0231493.tsv
Here is the detailed information for the gene from 'differential_transcrripts.tsv'.
Thank you. Looking forward to hearing from you.