Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
334 stars 80 forks source link

The potential repeat sequences bias of BRAKER3 #747

Open leon945945 opened 5 months ago

leon945945 commented 5 months ago

Hi, I used braker3 to annotated my phased haplotypes, and I annotated this two haplotypes with hard-masked and soft-masked genome, separately.

Results: hard-masked hap1: 22396 hard-masked hap2: 23017 soft-masked hap1: 24072 soft-masked hap2: 27374

As we can see, the gene number of two haplotypes were considerable with hard-masked genomes, but hap2 was annotated with more than 3000 genes than hap1 when using soft-masked genome.

Did these results demonstrate that braker3 has repeat sequences bias, the repeat sequences difference of two haplotypes make braker3 perform differentially?

KatharinaHoff commented 4 months ago

If I recall correctly, GeneMark-ETP computes internally a "repeat penalty", and I think that only works properly with softmasking. If it's hard masked, the N sequences will probably simply be ignored. @alexlomsadze may correct me on that.

For AUGUSTUS, softmasking opens the opportunity to extend a gene structure from unmasked into masked region if the genome was softmasked. So here, softmasking is usually an advantage compared to hard masking. This is reflected by the number that you observe.