Re-annotate RefSeq genome annotation using Braker3

Hello, I am using a reference genome of Folsomia candida available via NCBI's datasets for RNA-Seq expression studies. A part of the gene/transcript annotation contains multiple mRNA / CDS annotations of the same locus, which obviously do not represent splice variants, or C- and N-terminal extensions, etc. I observed some examples where the resulting peptides of the same locus are not related at all to each other (which I would expect from splice variants, etc.). I assume that many of these multiple CDS of the same locus are CDS predictions without any biological meaning, and it is hard to decide for over 6,000 loci with multiple CDS annotations, which one of the multiple CDS would be the correct one.

After watching the great youtube lecture by Katharina Hoff, I learned that there are some differences between Braker1, 2, and 3, respectively. I am wondering, whether anybody could recommend a specific Braker releas to re-analyze an annotated RefSeq genome in order to get rid of spurious CDS annotations. Since all of the annotated proteins of this genome assembly are part of RefSeq, it is probably better not to use these proteins as external hints in a new analyses in order to avoid detection these spurious proteins again. On the other hand, the gene locus seems to be annotated with higher confidence. Is it possible to provide only 'gene locus' annotations from the published reference genome to Braker in a new run?

Unfortunately, I could not test it directly at usegalaxy.eu since there seems to be an issue with the job (a support request has been already posted to the galaxy team.

Best regards,

BRAKER was not designed for re-annotation. It is possible to feed an existing annotation after reformatting it in augustus hints format, but I would probably not even do that, myself.

You are more looking for a post-hoc transcript filter. You already have identified the gene loci in question. I would probably run a functional annotation pipeline, such as Interproscan and remove the weird transcript that have no functional assignment. If they all have a function assignment, that may explain why they are all in RefSeq... maybe they are not that wrong.

If you generate the BRAKER evidence (basically an augustus hints file) and reformat the reference annotation into augustus output format, you could also feed that into BRAKER to filter. But it seems to be a bit more scripting that doing a simple function annotation based filtering. (However, runtime will be much shorter.)

BRAKER is not good at reannotating specific loci. I think you can do it for the augustus part of the BRAKER run since AUGUSTUS allows to select certain loci for prediciton, but I don't think it makes much sense to do this, repeatedly.

Gaius-Augustus / BRAKER

Re-annotate RefSeq genome annotation using Braker3 #759