Busco and number of genes when use the protein support file (in Braker3)

ardy20 commented 11 months ago

Hello Developers

We have tried to use Busco to test the completeness of an Australian wild rice genome before and after annotation with Braker3 (with RNA-seq support from the leaves). The genome quality was high (with Hifi data) at chromosome-level. The genome showed a Busco value of +99.5 before annotation. After annotation with Braker3 , we used the annotated protein output file coming from Braker3 to test the Busco (with Virdiplantae dataset) and it showed a reduced value of 65% of complete single genes.

We tried to improved the annotation quality (this wild rice with Braker3) by addition of Virdiplantae or rice annotated protein file (on top of RNA-seq support. It improved the Busco level back to 95% but the number of genes and CDS significantly reduced (from 52K to 32K).

My questions are:

1) Is Busco a good test to assess the quality of annotation (using proteins file generated by Braker3)? 2) Why the number of genes/CDS gets reduced when we use a protein dataset like virdiplantae or rice annotated protein?

Regards

KatharinaHoff commented 11 months ago

BUSCO is a tool to measure sensitivity with respect to a rather small set of core genes in a clade. We commonly use it. If you want to measure sensitivity with respect to a larger number of genes, try OMArk.

AUGUSTUS and GeneMark have a tendency to overpredict genes. TSEBRA tries to fix this based on evidence, but if there is too little evidence, too many transcripts may be dropped. BUSCOs often have enough evidence, that's why the BUSCO number went up when you added the plants OrthoDB parition.

I usually re-run TSEBRA enforcing the best gene set (either augustus.hints.gtf or genemark.gtf) according to BUSCO if this happens. You find instructions at https://github.com/KatharinaHoff/BRAKER-TSEBRA-Workshop/blob/main/GenomeAnnotation.ipynb .

Rice was previously reported to have ~35K genes. 32K doesn't seem so far off. You will likely have an inflated gene count if you re-run TSEBRA as suggested.

ardy20 commented 11 months ago

Thank you for the clarification! Additionally, I'm curious about how Braker3 handles the soft- and hard-masking of genomes, a process typically accomplished through RepeatModeler followed by RepeatMasker before annotation.

KatharinaHoff commented 11 months ago

Hardmasking means nucleotides are ignored.

Augustus can initiate a gene in an unmasked region and extend into the softmasked region.

We recommend softmasking.

ardy20 commented 11 months ago

Could you please guide how to set the weight for Augustus and Genmarks in configuration file of TSEBRA?

On Thu, 23 Nov 2023, 2:49 pm Katharina Hoff, @.***> wrote:

Hardmasking means nucleotides are ignored.

Augustus can initiate a gene in an unmasked region and extend into the softmasked region.

We recommend softmasking.

— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/708#issuecomment-1823817883, or unsubscribe https://github.com/notifications/unsubscribe-auth/APA35LZBJDPOHFDZKACLKM3YF3IUJAVCNFSM6AAAAAA7VSOWUKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRTHAYTOOBYGM . You are receiving this because you authored the thread.Message ID: @.***>

KatharinaHoff commented 11 months ago

I usually do not alter the parameters. There's a command line option to enforce entire gene sets. The linked jupyter notebook has a command example.

ardy20 commented 11 months ago

Thanks!

Gaius-Augustus / BRAKER

Busco and number of genes when use the protein support file (in Braker3) #708