GenomeRIK / tama

Transcriptome Annotation by Modular Algorithms (for long read RNA sequencing data)
GNU General Public License v3.0
125 stars 24 forks source link

TAMA Collapse assigns truncated reads to wrong transcript #104

Closed cathycoutu closed 10 months ago

cathycoutu commented 1 year ago

Thanks for providing TAMA, I'm enjoying the flexibility, filtering options, and ability to track what's happening. This is my first time working with IsoSeq data. My goal is to improve the accuracy of our transcriptome by merging the IsoSeq-derived transcriptome with our existing short-read derived transcriptome.

I'm starting with FLNC reads from IsoSeq3.

When collapsing reads into transcripts in the presence of partially 5' degraded reads using the nocap option, the shorter partially-degraded reads are being assigned to transcript models supported by a single read if the variation occurs 5' to the start of the degraded read.

python ~/bin/tama/tama_collapse.py -s tama_split_20_1.sam -f genome.fa -p tama_split_20_1 -i 99 -x no_cap -a 100 -z 100 -sj sj_priority -lde 1 -sjt 20

Here's an alignment to show you the problem. All the reads prefixed with "2" (at the top of the alignment) were assigned to model 2. All the reads prefixed with "3" were assigned to model 3. Model 2 is represented by the majority of the reads. In model 3, the first intron has not been spliced out. The shorter reads could have been assigned to either model with equal confidence.

problem alignment

Model 3 is actually only supported by a single read (m64128_230204_024757/124062102/ccs). Assigning the partially degraded reads to model 3 makes it very difficult to auotomatically remove the model. It is not removed by remove_single_read_models.py, as it appears to be supported by 17 reads. This problem occurs for many high read depth genes in my dataset (note that model 2 was supported by over 1000 reads).

Can you recommend a way to remove partially-degraded reads prior to collapsing, or a setting in TAMA_collapse which would assign ambiguous reads (which could map to more than one model) to the model with the most reads?

I've attached fasta files containing the reads in the alignment, in case they would be useful.

Thanks again, Cathy G4714.2 reads subset.txt G4714.3 reads.txt

GenomeRIK commented 10 months ago

Hi Cathy,

Using the capped mode for collapsing would solve this issue.

Sorry for the late reply!

Thank you, Richard

cathycoutu commented 10 months ago

No worries, I asked in 2 different places and got the same answer months ago.

In fact,I just presented using TAMA to all Agriculture and Agrifood Canada bioinformaticians today!

Great software, thank you!

Cathy

On Wed, Nov 1, 2023, 7:00 p.m. GenomeRIK @.***> wrote:

Hi Cathy,

Using the capped mode for collapsing would solve this issue.

Sorry for the late reply!

Thank you, Richard

— Reply to this email directly, view it on GitHub https://github.com/GenomeRIK/tama/issues/104#issuecomment-1789896583, or unsubscribe https://github.com/notifications/unsubscribe-auth/AU6ETDGVJIGX3JBZ7FSBY5LYCLWDLAVCNFSM6AAAAAAYG7RBWWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBZHA4TMNJYGM . You are receiving this because you authored the thread.Message ID: @.***>

GenomeRIK commented 10 months ago

Hi Cathy,

That's great that you were able to get some answers! And thank you so much for using TAMA! I'll try to be quicker with my responses next time but feel free to email me if it is urgent.

Thank you, Richard