ablab / IsoQuant

Transcript discovery and quantification with long RNA reads (Nanopores and PacBio)
https://ablab.github.io/IsoQuant/
Other
153 stars 13 forks source link

Under the Reference-base model, TSS and TTS are not accurately annotated. #203

Open zpliu1126 opened 5 months ago

zpliu1126 commented 5 months ago

Hi~ Andrey, When we updated isoform annotation using IsoQuant based on the existing reference annotation files, we found that it could not update the TSS and TTS of the annotated isoforms. For example, AD1_HC04_D02_425340.2 already exists in the reference annotation and is the result of IsoQuant's updated annotation. However, from the perspective of RNA read alignment, the TSS and TTS of this transcript AD1_HC04_D02_425340.2 were not well updated. Based on the FPKM calculated by Stringtie, I found that the FPKM of AD1_HC04_D02_425340.2 is only 5.6, while the FPKM of AD1_HC04_D02_425340.1 is 20; this is inconsistent with the observation results of read alignment, at least few reads were aligned at the RI site. After I manually modified the TSS and TTS of the AD1_HC04_D02_425340.2 transcript, the FPKM of AD1_HC04_D02_425340.2 is 30 times higher than that of AD1_HC04_D02_425340.1; this seems more reasonable. Similar annotation cases are quite common throughout the genome.

)KY3 `3BU~LIO%`0`U4YYA3

Same example

image

Best zpliu

andrewprzh commented 5 months ago

Dear @zpliu1126

Yes, this is a known problem (previously reported here https://github.com/ablab/IsoQuant/issues/92). If a novel isoform has a known intron chain, then TSS/TES positions are taken from the annotation, which is quite inaccurate.

It turned out to be a bit more complex than I anticipated - while reporting polyA sites is rather straightforward, detecting TSS can be non-trivial. Although we are rather short on man-power (I'm basically maintaining the project alone), I'll try to get my hands on it in the nearest future as this issue bugs me a lot.

Best Andrey

zpliu1126 commented 5 months ago

Dear @andrewprzh

If a novel isoform has a known intron chain, then TSS/TES positions are taken from the annotation, which is quite inaccurate.

Perhaps one could choose TSS that lead to longer transcript lengths, while utilizing polyadenylation sites (polyA sites) near transcription termination sites (TTS) to provide information. From my example, it is found that inaccurate TTS and TSS will bring some interference to the comparison of transcript expression.

Best zpliu

andrewprzh commented 5 months ago

Probably this kind of kind of strategy makes sense, yes. In any case, it requires quite some time to implement, validate and make sure algorithmic changes do not make it worse.

I'll keep you updated on this matter!

Best Andrey