Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
352 stars 79 forks source link

Question about a "Segmentation foult" #361

Closed chodarq closed 3 years ago

chodarq commented 3 years ago

Hi. I'm running Braker on a non-model organism. I have several already assembled and masked scaffolds, but no RNAseq, proteins or any sequence for this species. So I decide use a set of busco proteins and use this instruction: braker.pl --genome=C_horridus_ref.fna.mask --species==Choridus --prot_seq=busco.fa --epmode --gff3 --cores=1

After some hours, I receive this info:

[Thu Apr 22 06:26:18 2021] Enqueueing pair 6054/6060 (99.9%). Est. time left: 00:00:02 (hh:mm:ss) /home/chodar/Descargas/ProtHint-2.6.0/bin/spalnBatch.sh: línea 61: 22013 Hecho "$binDir/../dependencies/spaln" $mode -LS -pw -S1 -O1 -l $alignmentLength "$nuc" "$prot" 2> /dev/null 22014 Violación de segmento ('core' generado) | "$binDir/../dependencies/spaln_boundary_scorer" -o "${nuc}_${prot}" -w 10 -s "$binDir/../dependencies/blosum62.csv" -e $min_exon_score [Thu Apr 22 06:27:21 2021] 6060/6060 (100%) pairs aligned [Thu Apr 22 06:27:21 2021] Alignment of pairs finished [Thu Apr 22 06:27:21 2021] Translating coordinates from local pair level to contig level [Thu Apr 22 06:27:22 2021] Finished spliced alignment [Thu Apr 22 06:27:22 2021] Flagging top chains [Thu Apr 22 06:27:22 2021] Processing the output [Thu Apr 22 06:27:25 2021] Output processed [Thu Apr 22 06:27:25 2021] ProtHint finished.

And a warning: The hints file(s) for GeneMark-EX contain less than 150 introns with multiplicity >= 4! (In total, 7330 unique introns are contained. 113 have a multiplicity >= 4. Possibly, you are trying to run braker.pl on data that does not provide sufficient multiplicity information. This will e.g. happen if you try to use introns generated from assembled RNA-Seq transcripts; or if you try to run braker.pl in epmode with mappings from proteins without sufficient hits per locus. Or if you use the example data set. A low number of intron hints with sufficient multiplicity may result in a crash of GeneMark-EX (it should not crash with the example data set).

As far I can see, the process is still running, but, I wonder if the segmentation fault affects the result in some way?. Thanks in advance.

tomasbruna commented 3 years ago

Hello,

the segmentation fault can be safely ignored. ProtHint is designed to tolerate these faults (which can rarely occur during pairwise alignment). If there was only one such segfault message, it means that only 1/6060 of alignments was affected.

The warning about multiplicity is caused by using the BUSCO proteins. BRAKER2 is designed to extract information from a large number of proteins of any evolutionary distance, ideally from multiple species (and even remotely related species help). The BUSCO protein set, on the other hand, contains only a relatively small number of the most conserved proteins.

Please try using a relevant section of the OrthoDB protein database, the instructions can be found here https://github.com/gatech-genemark/ProtHint#protein-database-preparation. The Vertebrata section will work best for you.

Best, Tomas

chodarq commented 3 years ago

Great! Thanks very much. Best, Christian.