Selecting protein sequences after merging annotation sets; what do the anno1 and anno2 prefixes mean in gtf file?

Hi, Thanks for this great tool. I ran it using a case similar to the example found on this github page and got the gtf file. For each transcript in the gtf file I then wanted to pull out the corresponding amino acid sequence from the augustus_hints.aa files in the braker1_out and braker2_out working folders. I noticed that in the gtf file the transcript ids are prefixed with anno1 or anno2.

For example I have two transcript with the following names: anno2.g23134.t1 anno1.g27019.t1

If I go to the braker1_out folder and search for an amino acid sequence called g23134.t1 in the augustus.hints.aa it's there. But it's also there if I search for it in the augustus.hints.aa file in the braker2_out folder.

I was thinking that anno1 means the corresponding sequence is found in the braker1 folder and anno2 in the braker2 folder since I passed these files to TSEBRA in that order. Not sure at the moment.

Can anyone clear this up? Thanks. I just need to know what protein sequence to pull out for each transcript in the merged gtf file. It doesn't seem to be something that TSEBRA will do for you.

Thnaks - Robert

Hi Robert,

you are correct that the last part of a transcript ID (e.g. g23134.t1) matches the ID from its source (i.e. BRAKER1 or BRAKER2). The first part ("anno1" or "anno2") specifies the input gene set where a transcript can be found, "anno1" matches the first gene set listed after the "--gtf"/"-g" option, and "anno2" the second one. For example, if you called TSEBRA like ./bin/tsebra.py -g braker1_out/augustus.hints.gtf,braker2_out/augustus.hints.gtf -c default.cfg \ -e braker1_out/hintsfile.gff,braker2_out/hintsfile.gff \ -o tsebra.gtf, this means that you will find transcript "anno1.g27019.t1" in the output of BRAKER1.

If you want to get the amino acid sequences of the transcripts of the TSEBRA output, you can also compute them directly with the script 'gtf2aa' from the AUGUSTUS repository, e.g.: gtf2aa genome.fa tsebra.gtf aa_seqs.fa

Best, Lars

Gaius-Augustus / TSEBRA

Selecting protein sequences after merging annotation sets; what do the anno1 and anno2 prefixes mean in gtf file? #21