GenomeRIK / tama

Transcriptome Annotation by Modular Algorithms (for long read RNA sequencing data)
GNU General Public License v3.0
125 stars 24 forks source link

Understanding the behaviour of duplicate transcript groups #139

Open omarelgarwany opened 1 month ago

omarelgarwany commented 1 month ago

Hello

Thanks very much for developing TAMA. I'm trying to understand the behaviour of merging duplicate transcript groups. So I run tama_merge.py on a set of 15 transcriptomes from 15 samples. I'm using the following command: python tama_merge.py -f file_list.txt -p /path/to/output/project -m 5 -a 50 -z 100

I'm intentionally running it without the option -d merge_dup" because I wanted to understand what the behaviour of merging duplicate groups looks like. I'm using permissive 5' and 3' parameters (-a and -z) as well as allow splice junction mismatches (-m) because I wanted to replicate the isoseq collapse default behaviour which I use to collapse transcripts within a single sample. Anyway, as expected TAMA stopped at this stage:

190218 190523 ['SAMPLENAME14548286_PB.21.188', 'SAMPLENAME14548285_PB.26.225', 'SAMPLENAME14548288_PB.34.181', 'SAMPLENAME14548275_PB.13.196', 'SAMPLENAME14548275_PB.13.197', 'SAMPLENAME14548287_PB.34.170', 'SAMPLENAME14548275_PB.13.195', 'SAMPLENAME14548286_PB.21.187', 'SAMPLENAME14548277_PB.29.173', 'SAMPLENAME14726268_PB.9.248', 'SAMPLENAME14726268_PB.9.249'] chr1 190218 190523 G2;G2.tmp.909 40 - 190218 190523 200,0,255 1 305 0 chr1 190218 190523 G2;G2.tmp.496 40 - 190218 190523 200,0,255 1 305 0 a########################################### chr1 190013 190347 SAMPLENAME14548278_PB.12.119;SAMPLENAME14548278_PB.12.119 40 - 190013 190347 255,0,0 1 334 0 chr1 190129 190939 SAMPLENAME14726268_PB.9.247;SAMPLENAME14726268_PB.9.247 40 - 190129 190939 255,0,0 1 810 0 chr1 190218 190523 SAMPLENAME14548286_PB.21.187;SAMPLENAME14548286_PB.21.187 40 - 190218 190523 255,0,0 1 305 0 chr1 190019 190919 SAMPLENAME14548280_PB.7.150;SAMPLENAME14548280_PB.7.150 40 - 190019 190919 255,0,0 1 900 0 chr1 190218 190701 SAMPLENAME14548277_PB.29.173;SAMPLENAME14548277_PB.29.173 40 - 190218 190701 255,0,0 1 483 0 chr1 190218 190523 SAMPLENAME14548275_PB.13.195;SAMPLENAME14548275_PB.13.195 40 - 190218 190523 255,0,0 1 305 0 chr1 190095 190356 SAMPLENAME14548284_PB.11.197;SAMPLENAME14548284_PB.11.197 40 - 190095 190356 255,0,0 1 261 0 b########################################### chr1 190351 190616 SAMPLENAME14548286_PB.21.188;SAMPLENAME14548286_PB.21.188 40 - 190351 190616 255,0,0 1 265 0 chr1 190327 190630 SAMPLENAME14548285_PB.26.225;SAMPLENAME14548285_PB.26.225 40 - 190327 190630 255,0,0 1 303 0 chr1 190349 190989 SAMPLENAME14548288_PB.34.181;SAMPLENAME14548288_PB.34.181 40 - 190349 190989 255,0,0 1 640 0 chr1 190327 191173 SAMPLENAME14548275_PB.13.196;SAMPLENAME14548275_PB.13.196 40 - 190327 191173 255,0,0 1 846 0 chr1 190439 190689 SAMPLENAME14548275_PB.13.197;SAMPLENAME14548275_PB.13.197 40 - 190439 190689 255,0,0 1 250 0 chr1 190287 191003 SAMPLENAME14548287_PB.34.170;SAMPLENAME14548287_PB.34.170 40 - 190287 191003 255,0,0 1 716 0 chr1 190218 190523 SAMPLENAME14548275_PB.13.195;SAMPLENAME14548275_PB.13.195 40 - 190218 190523 255,0,0 1 305 0 chr1 190218 190523 SAMPLENAME14548286_PB.21.187;SAMPLENAME14548286_PB.21.187 40 - 190218 190523 255,0,0 1 305 0 chr1 190218 190701 SAMPLENAME14548277_PB.29.173;SAMPLENAME14548277_PB.29.173 40 - 190218 190701 255,0,0 1 483 0 chr1 190351 190906 SAMPLENAME14726268_PB.9.248;SAMPLENAME14726268_PB.9.248 40 - 190351 190906 255,0,0 1 555 0
chr1 190456 190745 SAMPLENAME14726268_PB.9.249;SAMPLENAME14726268_PB.9.249 40 - 190456 190745 255,0,0 1 289 0
By default TAMA merge does not allow merging of duplicate transcript groups. Duplicate transcript groups occur when different groupings of transcripts results in the same collapsed model. If you would like to merge duplicate transcript groups please add -d merge_dup to the arguments.

My genes are name with the Isoseq convention of PB.# and transcripts as PB.#.#. I can see, given my parameters, why each of these transcripts belong to their respective groups. If we think of transcripts as nodes in a graph structure, then any two nodes in each transcript group are connected, either directly or indirectly (not sure if this understanding is correct).

But where I'm puzzled is this. Both transcript groups share the transcripts PB.13.195 and PB.21.187. This suggests that both transcript groups are really one and the same, since both graphs (or transcripts) share the nodes PB.13.195 and PB.21.187.

So my question is why are these two transcript groups not considered one group by default? What prevents TAMA from merging all these from the outset? And how were these two transcript groups formed separately in the first place?

Thank you very much