Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
354 stars 79 forks source link

incorrect 5' Met prediction #283

Open dstern opened 4 years ago

dstern commented 4 years ago

I am not sure where in the braker pipeline this issue is relevant, but perhaps helpful to post it here. I have noticed that the 5' Met start is often not predicted at the most 5' possible Met. I attach image of one example here, but these are quite common for genes of interest. The track on the bottom is output from braker, predicting gene start at downstream Met. Top track is manual annotation from RNAseq data revealing short UTR and upstream Met. image

tomasbruna commented 3 years ago

Hello,

These predictions are not necessarily incorrect: BRAKER can predict the alternative downstream M if, for example:

Some estimates state that roughly 10% of eukaryotic transcripts are initiated from the non-first ATG (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2603428/)

Best, Tomas

karlgem commented 3 years ago

@dstern I’m interested in this too. Do you have evidence that it should be the most upstream M in this case? And what about in Eukaryotes in general? I couldn’t find much evidence to support that. Thanks

dstern commented 3 years ago

I am studying a family of proteins that carry N-terminal signal sequences. In all cases I have found so far where an upstream ATG is missed, the upstream sequence includes the signal sequence. When I have RNAseq data that covers the 5' UTR, it always extends beyond the first ATG. These are the two pieces of evidence I use at the moment to mark the upstream ATG as the start Met. Since this is rather tedious hand annotation, I don't have a genome-wide estimate of the frequency. I just kept noticing it in my genes of interest and thought I should ping the developers.

tomasbruna commented 3 years ago

Thank you @dstern for the additional info and for bringing this up in general. I will keep this in mind and design an experiment using one of the well-annotated model organisms to compute how many times BRAKER predicts a protein isoform shorter than a corresponding one with upstream M in the reference annotation.