Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
355 stars 79 forks source link

Performance (based on BUSCO) of Braker3, Stringtie, and integration of miniprot #647

Closed BitaoQiu closed 8 months ago

BitaoQiu commented 1 year ago

Dear Braker3 developers,

Here is a summary of our BUSCO results for the annotation performance of a termite genome with Braker 3 (RNAseq + insecta_odb10) and StringTie2. I also included the result of MiniBUSCO that based on miniprot.

  Complete BUSCOs Complete and single-copy BUSCOs Complete and duplicated BUSCOs Fragmented BUSCOs Missing BUSCOs
Braker 3 1345 824 521 7 15
StringTie2 1361 600 761 4 2
MiniBUSCO (miniprot)   1330 36 1 0

It seems to me StringTie has a better performance in comparison to Braker3 that combines both RNAseq and protein evidence, and MiniBUSCO does not miss any of the BUSCO.

May I is it because Braker3 has filtered some of the single exon genes? And whether integrating miniprot into Braker3 will have a better performance?

Best regards, Bitao

KatharinaHoff commented 1 year ago

Dear Bitao,

We are aware of the problem that while dropping unsupported gene structures with TSEBRA, we may lose some true genes. it's the price that we pay for specificity.

Single exon genes without support are only dropped by TSEBRA in large genomes (probably not the case in many insecta, not sure how big your genome is, some of them are of course large due to TEs).

Another reason is that some BUSCOs are located in softmasked regions of the genome. Augustus and GeneMark will not predict them.

A possible approach that would be very easy to implement in your particular scenario (you already know that your StringTie set is amazing, but it will lack the ab initio prediction and predictions of genes with partial coverage) is to run TSEBRA to enforce the StringTie gene set, and add the BRAKER gene set and the BRAKER evidence on top.

Tomas, Lars and I have been discussing adding "missing BUSCOs" automatically in BRAKER and GALBA. It's not a nice idea from the accuracy assessment point of view (it's like cheating), but a lot of users would likely appreciate not to do that manually, and the renamed miniBUSCO is so nicely efficient... it's on our to do list.

Best,

Katharina

On Mon, Jun 26, 2023 at 2:21 PM Bitao Qiu @.***> wrote:

Dear Braker3 developers,

Here is a summary of our BUSCO results for the annotation performance of a termite genome with Braker 3 (RNAseq + insecta_odb10) and StringTie2. I also included the result of MiniBUSCO that based on minimap2. Complete BUSCOs Complete and single-copy BUSCOs Complete and duplicated BUSCOs Fragmented BUSCOs Missing BUSCOs Braker 3 1345 824 521 7 15 StringTie2 1361 600 761 4 2 MiniBUSCO (minimap2) 1330 36 1 0

It seems to me StringTie has a better performance in comparison to Braker3 that combines both RNAseq and protein evidence, and MiniBUSCO does not miss any of the BUSCO.

May I is it because Braker3 has filtered some of the single exon genes? And whether integrating minimap2 into Braker3 will have a better performance?

Best regards, Bitao

— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/647, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JB2UFSODX5JWXPFDN3XNF5GJANCNFSM6AAAAAAZUCTNZI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ckeeling commented 1 year ago

Not really the point of this thread, but I was wondering why your (miniBUSCO) compleasm result has so few duplicated BUSCOs compared to BRAKER3 and StringTie2 @BitaoQiu? Do the first two results include splice variants in the protein predictions examined, and compleasm, working on the genome, doesn't consider this? I think that it has been discussed on the BUSCO site before that BUSCO is better at finding BUSCO genes than more general annotators (BRAKER, MAKER, etc.) are at finding the BUSCO genes, but these are better at finding all the genes.

BitaoQiu commented 1 year ago

Not really the point of this thread, but I was wondering why your (miniBUSCO) compleasm result has so few duplicated BUSCOs compared to BRAKER3 and StringTie2 @BitaoQiu? Do the first two results include splice variants in the protein predictions examined, and compleasm, working on the genome, doesn't consider this? I think that it has been discussed on the BUSCO site before that BUSCO is better at finding BUSCO genes than more general annotators (BRAKER, MAKER, etc.) are at finding the BUSCO genes, but these are better at finding all the genes.

Yes, you are right that the first two were done in transcriptome mode of Busco, whereas compleasm is done on the genome. So this explains the hight duplication rate because it contains variants.

You may be right that Busco is better at finding Busco genes. I think what compleasm shows here is that Miniprot is a good tool for finding homologous genes.

Best, Bitao

BitaoQiu commented 1 year ago

@KatharinaHoff Thank you for the advice!

shelkmike commented 1 year ago

@KatharinaHoff

Tomas, Lars and I have been discussing adding "missing BUSCOs" automatically

Please, don't do this. A very good method of annotation quality control is to compare BUSCO completeness for: 1) The genome. 2) Proteins from the annotation.

If completeness for the second is much smaller than for the first, this indicates low annotation quality. Directly adding BUSCO genes to the annotation will ruin this quality control method.

KatharinaHoff commented 10 months ago

I will document the implications for accuracy assessment in the README file before merging the compleasm branch to master.

KatharinaHoff commented 8 months ago

This has been presented at PAG, it's implemented and documented.