Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
364 stars 81 forks source link

Genemark gmes_petap.pl fail, too few introns, one sample not others #543

Open ToriEggers opened 2 years ago

ToriEggers commented 2 years ago

Hi, I have four nematode genome samples and I'm running BRAKER with genemark epmode to annotate with protein + genome, with RNA + genome, and then combine the two with TSEBRA. 3 of the genomes process perfectly fine but on another I keep running into a problem with the gmes_petap.pl step no matter the size or evolutionary distance of the protein file that I use (I've tried many, from sister species to all metazoa). Though the protein +genome run fails on this sample, the RNA + genome completes. When running esmode the protein annotation completes. I did these samples close in time, so there was no software updates or changes to my environment between samples.

Any idea as to why this one particular sample won't run like the others?

Error in braker.log:

                          RUNNING GENEMARK-EX

Preparing genemark_evidence file hints from manual hints... Checking whether file /home/data/jfierst/veggers/DF5033_BRAKER_odb10/genemark_hintsfile.gff contains enough hints and sufficient multiplicity information...

WARNING: The hints file(s) for GeneMark-EX contain less than 1000 introns. (In total, 6 unique introns are contained.) Genemark-EX might fail due to the low number of hints.

WARNING: The hints file(s) for GeneMark-EX contain less than 150 introns with multiplicity >= 4! (In total, 6 unique introns are contained. 0 have a multiplicity >= 4.) Possibly, you are trying to run braker.pl on data that does not provide sufficient multiplicity information. This will e.g. happen if you try to use introns generated from assembled RNA-Seq transcripts; or if you try to run braker.pl in epmode with mappings from proteins without sufficient hits per locus. Or if you use the example data set. A low number of intron hints with sufficient multiplicity may result in a crash of GeneMark-EX (it should not crash with the example data set).

Running GeneMark-EP changing into GeneMark-EP directory /home/data/jfierst/veggers/DF5033_BRAKER_odb10/GeneMark-EP cd /home/data/jfierst/veggers/DF5033_BRAKER_odb10/GeneMark-EP Running gmes_petap.pl perl /home/data/jfierst/veggers/gmes_linux_64/gmes_petap.pl --verbose --seq /home/data/jfierst/veggers/DF5033_BRAKER_odb10/genome.fa --EP /home/data/jfierst/veggers/DF5033_BRAKER_odb10/genemark_hintsfile.gff --c ores=8 --gc_donor 0.001 --evidence /home/data/jfierst/veggers/DF5033_BRAKER_odb10/genemark_evidence.gff --soft_mask auto 1>/home/data/jfierst/veggers/DF5033_BRAKER_odb10/GeneMark-EP.stdout 2>/home/data/jfierst /veggers/DF5033_BRAKER_odb10/errors/GeneMark-EP.stderr

The GeneMark-EP.stderr file is empty

tiandavid commented 2 years ago

Just wanted to second this issue and add that I'm running into the same issues. For one genome, protein+genome and RNA+genome work perfectly for input into TSEBRA, while for the other genome, RNA+genome works while protein+ genome does not.

I think the issue has something to do with DIAMOND not properly computing alignments - for the sample that did not work this step lasted less than a second according to the output and only resulted in 2 pairwise alignments. But not sure why yet!

romseg commented 1 year ago

Has anyone found any solution to this issue? I am experiencing similar problems with one particular plant genome using proteins only. Thank you.