Gaius-Augustus / Augustus

Genome annotation with AUGUSTUS
http://bioinf.uni-greifswald.de/webaugustus/
287 stars 109 forks source link

chromosomes longer than 2.1 GB lead to crash #353

Open MarioStanke opened 2 years ago

MarioStanke commented 2 years ago

Apparently, this is a result of 4Byte int not allowing for positions that are 2^31 or larger. Error message

examining piece 1..-1928087540 (-1928087540 bp)
terminate called after throwing an instance of 'std::bad_alloc'
MarioStanke commented 2 years ago

Apparently this is not completely solved, at least when predictions are requested on the complete chromosome in one run (rather than using --predictionStart and --predictionEnd)

$AUGUSTUS --species=rice --softmasking=0 --protein=on --codingseq=on --progress=true --gff3=on --alternatives-from-evidence=false --alternatives-from-sampling=false --extrinsicCfgFile=$EXCFFILE $GENOME_PART

leads to a segmentation fault after ~10k minutes compute time.

examining piece 2147286171..-2147481126 
piroyon commented 2 years ago

How about changing the type of beginPos, endPos, seqlen, restlen and the return value of getNextCutEndPoint from int to long in namgene.cc.

diff namgene.cc namgene.cc.org 
536,537c536,537
<   long endPos, beginPos;
<   long seqlen = strlen(dna);
---
>   int endPos, beginPos;
>   int seqlen = strlen(dna);
972,973c972,973
< long NAMGene::getNextCutEndPoint(const char *dna, long beginPos, int maxstep, SequenceFeatureCollection& sfc){
<   long restlen = strlen(dna+beginPos);
---
> int NAMGene::getNextCutEndPoint(const char *dna, int beginPos, int maxstep, SequenceFeatureCollection& sfc){
>   int restlen = strlen(dna+beginPos);

Using long would increase the memory requirements. I haven't encountered this error, so sorry if it doesn't work.