Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
348 stars 79 forks source link

Problem wit GeneMark #49

Closed diriano closed 5 years ago

diriano commented 5 years ago

Dear developers,

I am running BRAKER 2.1.2 on a plant genome with around 40K scaffolds. But the job is dying at some stage with GeneMark. GeneMark dies with the following error reported to the STDERR:

ERROR in file /Storage/progs/BRAKER-2.1.2/scripts/braker.pl at line 5307 Failed to execute: perl /Storage/progs/gm_et_linux_64_v4.38/gmes_petap/gmes_petap.pl --verbose --seq /Storage/data1/riano/Thalictrum/GenomeAnnotation/BRAKER/Masked/MaSuRCA_WT478/braker/MaSuRCA_WT478/genome.fa --max_intergenic 50000 --evidence /Storage/data1/riano/Thalictrum/GenomeAnnotation/BRAKER/Masked/MaSuRCA_WT478/braker/MaSuRCA_WT478/GeneMark-ETP/evidence.gff --et_score 10 --ET /Storage/data1/riano/Thalictrum/GenomeAnnotation/BRAKER/Masked/MaSuRCA_WT478/braker/MaSuRCA_WT478/genemark_hintsfile.gff --cores=1 --soft_mask 1000 1>/Storage/data1/riano/Thalictrum/GenomeAnnotation/BRAKER/Masked/MaSuRCA_WT478/braker/MaSuRCA_WT478/GeneMark-ETP.stdout 2>/Storage/data1/riano/Thalictrum/GenomeAnnotation/BRAKER/Masked/MaSuRCA_WT478/braker/MaSuRCA_WT478/errors/GeneMark-ETP.stderr

braker.log does not report any error. GeneMark-ETP.stdout has the following:

check before run create directories commit input data data report commit training data training data report prepare initial model get GC of sequence GC 36 build initial ET model running step ET_A running gm.hmm on local system 3 contigs in training concatenate predictions: /Storage/data1/riano/Thalictrum/GenomeAnnotation/BRAKER/Masked/MaSuRCA_WT478/braker/MaSuRCA_WT478/GeneMark-ETP/run/ET_A_1 training level ET_A: /Storage/data1/riano/Thalictrum/GenomeAnnotation/BRAKER/Masked/MaSuRCA_WT478/braker/MaSuRCA_WT478/GeneMark-ETP/run/ET_A_1 From 261 loaded 232 and ignored dublications 29 exon no_match match_one match_two Initial 7 5 0 Internal 8 11 28 Terminal 9 6 0 Single 4 0 0 CDS_no_match all short long seq_short seq_long CDS_no_match 28 20 8 5927 13691 Intergenic all between_match seq_match Intergenic: 18 2 2893 error, no valid sequences were found error on call: /Storage/progs/gm_et_linux_64_v4.38/gmes_petap/make_nt_freq_mat.pl --cfg /Storage/data1/riano/Thalictrum/GenomeAnnotation/BRAKER/Masked/MaSuRCA_WT478/braker/MaSuRCA_WT478/GeneMark-ETP/run.cfg --section stop_TAG --format TERM_TAG

I am running BRAKER as the following: braker.pl --etpmode --softmasking --species=$SP --genome=$ASSEMBLY --bam=${SP}.scf_gt1000bp.sorted.bam --hints=prot_hintsfile.aln2hints.gff --cores=$NSLOTS --AUGUSTUS_CONFIG_PATH=AugustusCONFIG --AUGUSTUS_BIN_PATH=/Storage/progs/Augustus-3.3.1-tag1/bin --AUGUSTUS_SCRIPTS_PATH=/Storage/progs/Augustus-3.3.1-tag1/scripts prot_hintsfile.aln2hints.gff are hits to a related species (same family), generated in a previous run of BRAKER (unmasked genome) with GenomeThreader. The genome has been softmasked using RepeatModeller/RepeatMasker.

Any suggestion on how to carry on with genome annotation is greatly appreciated. Thanks, Diego

KatharinaHoff commented 5 years ago

Please contact the GeneMark developers about this issue. I don't think it is a BRAKER issue.

tomasbruna commented 5 years ago

Hello, GeneMark developer here -- the problem might be a small contig size, by default, only contigs longer than 50k are used in GeneMark-ETP training. In your genome, this filters out almost all contigs and GeneMark fails due to lack of training data.

Try setting BRAKER's min_contig option to a smaller number, for example: --min_contig=10000.

I believe this option is only available in the development state version, you will need to clone the latest version of this repository.

Best, Tomas

KatharinaHoff commented 5 years ago

Thank you for that comment, @tomasbruna ! I now mention this in the README "Common problems" section (https://github.com/Gaius-Augustus/BRAKER/commit/1e8ba0735f11b1f6d591acebc552b95b0e3ff381).

Closing the issue, now.

Katharina