Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
348 stars 79 forks source link

low BUSCO with Braker #470

Closed Tkastylevsky closed 1 year ago

Tkastylevsky commented 2 years ago

I allowed myself to create a new issue since mine and Yulong don't seem to have the same answers after all. So, the two lines I respectively used to run braker1 and braker2 are : braker.pl --genome=genome/genome.fa --species=sp1 --bam=star/sp1_out/Aligned.sortedByCoord.out.bam --workingdir=out_braker --cores 22 --softmasking braker.pl --genome=genome/genome.fa --prot_seq=prot/proteins.fasta --workingdir=res_braker --cores=48 --softmasking --epmode I did try to use TSEBRA with the default config file, it ended up giving me an intermediary busco score between my RNA and protein run (64%).

yuzhenpeng commented 2 years ago

I got the same problem----------------run braker for twice (RNA-seq only and proteins only).

But when I used the braker2 just for once. The BUSCO is higher about ~90%. My command is braker.pl --genome genome.fa --prot_seq prot.fa --prg gth --bam RNAseq.bam --gth2traingenes --softmasking --cores 48 --gff3.

Some one could help me. Thanks.

Zhenpeng

Tkastylevsky commented 2 years ago

Hello,as a followup : I ran Braker in etpmode and the same data that the one used separately for protein and rnaseq run. Here is the line I used : braker.pl --genome=genome/genome.fa --prot_seq=prot/proteins.fasta --workingdir=res_braker_etp --cores=48 --softmasking --etpmode --bam=star/Aligned.sortedByCoord.out.bam

I obtained this time a superior busco score on a specific ortholog set (77%) compared to both my previous runs, including the TSEBRA fusion. This is satisfactory for the analysis I intend to do, even if around 15% BUSCOs are missed (a BUSCO run directly on my genome returned 93% complete or duplicated BUSCOs). Best, Timothee

KatharinaHoff commented 2 years ago

It is an interesting question why BUSCO finds certain BUSCOs on genome level, that BRAKER2 does not predict as proteins.

We all have different input data sets. I am looking at a small non-model genome from chlorophytes. On genome level, I see 45 missing BUSCOs, on protein BRAKER2 level, I see 183 missing BUSCOs. 4 BUSCOs are only found the BRAKER2 protein set (i.e. metaeuk did not find them on genome level). (I am running BRAKER2 with proteins only.)

I have added the "missing genomic" BUSCO/MetaEuk predictions to my assembly hub for visualization. I have had a look at ~20 of the examples. I observe the following:

AUGUSTUS is not good at predicting extremely short genes (e.g. 58 bp: not possible). RepeatMasking may be an issue. I ran RepeatModeler2 to generate a species-specific library, and potentially, the genome is "overmasked". AUGUSTUS cannot predict a protein coding gene without evidence that is completely located in a repeat masked region. We apparently fail to generate seeds for the OrthoDB protein to genome alignments for some BUSCOs, that's why there is no evidence from ProtHint.

With BRAKER, we are currently not trying to maximize BUSCO recovery. This is just a list of reasons why the BUSCO scores between assembly and BRAKER predicted proteins can differ.

daniazi commented 2 years ago

Thanks for the explanation @KatharinaHoff. Does this mean one should not use masked genome for the same reasons? I got 92% (96% Single Copy) and 86% (15% SC) from BUSCO analysis with genome assembly and de novo transcriptome (with trinity) but the results from the braker annotation (with masked genome) were 28% (24% SC). So more single-copy genes were found if we compare the transcriptomes. I will test a few more rounds without masked genome as well as combined input as suggested above.

Tkastylevsky commented 2 years ago

Thanks you for your answer ! I experienced the same thing concerning masking on a mammalian genome. I ran Braker on both masked and unmasked genomes. I gained more than 10% BUSCO on the unmasked genome, I think the step that was most affected by the masking was the RNA-seq mapping with STAR, which is done on a hard masked genome. Quite concerning regarding repeatmasking on non model organisms

V-JJ commented 2 years ago

Hello!

But, in principle, should the RNA-seq mapping with STAR be performed on either unmasked/softmaked genome as suggested here #241 ? Shouldn't it?

By the way, we run into the same problem with a non-model, highly repetitive genome and now we are trying to compare the results between the new (recommended) pipeline with TSEBRA (default config settings) vs the previous pipeline (where both RNAseq BAM file and proteins are used as input for braker).

Regarding the quality of structural annotation @KatharinaHoff and taking into account your above comments, do you think that the BUSCO values (mainly complete and missing) are enough to decide which pipeline results in better overall quality?

Thanks in advance,

KatharinaHoff commented 2 years ago

You should always visually inspect structural genome annotation in a Genome Browser in context with evidence.

Vadim A. Pisarenco @.***> schrieb am Mo. 27. Juni 2022 um 12:26:

Hello!

But, in principle, should the RNA-seq mapping with STAR be performed on either unmasked/softmaked genome as suggested here #241 https://github.com/Gaius-Augustus/BRAKER/issues/241 ? Shouldn't it?

By the way, we run into the same problem with a non-model, highly repetitive genome and now we are trying to compare the results between the new (recommended) pipeline with TSEBRA (default config settings) vs the previous pipeline (where both RNAseq BAM file and proteins are used as input for braker).

Regarding the quality of structural annotation @KatharinaHoff https://github.com/KatharinaHoff and taking into account your above comments, do you think that the BUSCO values (mainly complete and missing) are enough to decide which pipeline results in better overall quality?

Thanks in advance,

— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/470#issuecomment-1167174635, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JFC5ZD6S53NCEIPUZLVRF6UBANCNFSM5P6ARUBQ . You are receiving this because you were mentioned.Message ID: @.***>

V-JJ commented 2 years ago

You should always visually inspect structural genome annotation in a Genome Browser in context with evidence. Vadim A. Pisarenco @.***> schrieb am Mo. 27. Juni 2022 um 12:26:

Understood, many thanks!

smallfishcui commented 2 years ago

Hi, Is this problem solved? I am using v2.1.6, and use RNAseq as evidence to braker, but only get 47.2% of the BUSCOs. My genome gets 98.8% of the BUSCOs complete, and gene prediction based on homology yielded 93.9% of BUSCOs. I was using earlier versions of braker2 and they all worked fine. Should I use earlier version instead?

thanks, Cui

smallfishcui commented 2 years ago

And I run protein evidence plus RNA evidence and the BUSCO did not get significant higher, from 47.2% to 51.3%.... Any suggestions here?

thanks, Cui

smallfishcui commented 2 years ago

Hi again,

Just to let you know my problem solved. It wasn't about the version of braker2, but the amount of RNAseq data. I obtained braker results as above when I used only one RNAseq sample in my analysis. After increasing the number of samples to three, the BUSCO prediction increased from 51.3% to 84.7%.

best, Cui