Transcripts assigned separate gene IDs after combining GTFs

GwynHN commented 1 year ago

Hi TSEBRA developers,

Thanks for the great tools, I am getting some really nice results.

After I combined the RNA and protein evidence, I've noticed some transcripts that had the same gene ID but labeled as different transcripts in the original annotation are then labeled with different gene IDs after being combined. For example, anno1.g23605.t1 and anno1.g23605.t2 become g_22925 and g_22926 in the combined GTF. There are several examples of this, but it doesn't happen all the time and I haven't been able to see a pattern.

With the repo I cloned back in November 2022 (v1.0.3), I ran the following:

bin/tsebra.py -g RNA/augustus.hints.gtf,Protein/augustus.hints.gtf -c config/default.cfg -e RNA/hintsfile.gff,Protein/hintsfile.gff -o rna_prot_combined.gtf

Best, Gwyneth

LarsGab commented 1 year ago

Hi,

TSEBRA groups all transcripts into the same gene that have overlapping coding regions in the same open reading frame. Without more information, I would assume that in your case these transcripts aren't in the same gene (at least in these terms). If you want to ignore the frame, you can use the new --ignore_tx_phase option.

Best, Lars

GwynHN commented 1 year ago

Hi Lars,

Ok, I see! In the one example I gave, the two original transcripts had the same start and stop positions for the gene and transcript features, but different start codons annotated. This is similar to issue #26.

Thanks! Gwyneth

gbdias commented 10 months ago

Hi @LarsGab

Is there a way we can have the --ignore_tx_phase option in the long_reads branch as well?

Gaius-Augustus / TSEBRA

Transcripts assigned separate gene IDs after combining GTFs #32