GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins
Tomas Bruna, Alexandre Lomsadze, Mark Borodovsky
NAR Genomics and Bioinformatics, 2022 Jun NAR GB, PubMed
Georgia Institute of Technology, Atlanta, Georgia, USA
GeneMark-EP+ is a semi-supervised eukaryotic gene prediction tool which utilizes protein hints to improve unsupervised parameter estimation and predictions.
Protein hints are generated by ProtHint, a fast protein mapping pipeline which predicts and scores introns, start and stop codons in the genome of interest from any number of proteins of unknown evolutionary distance (we recommend to use all eukaryotic proteins in OrthDB).
Due to its semi-supervised nature and ability to incorporate proteins of any evolutionary distance, GeneMark-EP+ is an optimal tool to predict genes in a novel genome without the need for a curated training set or a set of closely related proteins.
First run ProtHint to get protein hints (see ProtHint repository for details about usage and installation)
prothint.py genome.masked.fasta proteins.fasta --workdir ProtHintDir
Run GeneMark-EP+ with hints mapped by ProtHint.
gmes_petap.pl --EP ProtHintDir/prothint.gff --evidence ProtHintDir/evidence.gff --seq genome.masked.fasta --soft_mask 1000 --verbose
Runtime of GeneMark-EP+ is linear with respect to genome size.
ProtHint runtime is linear with respect to both genome size (GeneMark-ES is executed to generate initial genome seeds) and to the number of genes in a genome.
Runs were executed on a 8CPU/8GB RAM machine. Genomes were masked for repeats by RepeatModeler and RepeatMasker. Proteins from species within the same taxonomical genus were excluded in these experiments.
Drosophila melanogaster (134 Mb and \~14,000 genes) with OrthoDB Arthropoda target proteins:
Solanum lycopersicum (807 Mb and \~35,000 genes) with OrthoDB Plantae target proteins: