Error: 'std::bad_alloc'

ParBLiSS / FastANI

Fast Whole-Genome Similarity (ANI) Estimation

Apache License 2.0

368 stars 66 forks source link

Error: 'std::bad_alloc' #6

Closed SwapnilDoijad closed 6 years ago

SwapnilDoijad commented 6 years ago

fastaANI ran properly with 100 genomes. However, increased to 1000 genomes resulted in the following error

Error details:

$ fastANI --ql 1000_genomes.list --rl 1000_genomes.list -o output.txt

Reference = [1.fasta, 2.fasta, ......... 1000.fasta] Query = [1.fasta, 2.fasta, ......... 1000.fasta] Kmer size = 16 Fragment length = 3000 ANI output file = /media/network/project_Lm_all/results/43_fastANI/output.txt

terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped) fastANI --ql 1000_genomes.list --rl 1000_genomes.list -o output.txt

Hardware details

Processor | 8x Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz Memory | 65858MB Operating System | Ubuntu 16.04.3 LTS

cjain7 commented 6 years ago

std::bad_alloc implies FastANI could not allocate memory when it needed. For 1000 microbial genomes, I expect the memory usage to be much below 66G. Please answer few follow up questions here:

What is the total size of all genomes you have in in 1000_genomes.list? I wonder if it is too big.
Can you provide memory usage of above run? It can be easily obtained by using the /usr/bin/time utility:

/usr/bin/time fastANI --ql 1000_genomes.list --rl 1000_genomes.list -o output.txt

Please make sure you are not running other memory intensive tasks on your system while doing this.

Other thing you may want to try is split the reference list (--rl) 1000_genomes.list into two lists of 500 genomes to reduce the memory use and run them one by one as:

fastANI --ql 1000_genomes.list --rl 500_genomes_first.list -o output_1.txt fastANI --ql 1000_genomes.list --rl 500_genomes_second.list -o output_2.txt cat output_1.txt output_2.txt > output.txt

SwapnilDoijad commented 6 years ago

1000_genomes.list contains a list of 1000 genomes, each 3 Mb.
After closing all other programs

(A) for 1000 genome

$ /usr/bin/time fastANI --ql 1000_genomes.list --rl 1000_genomes.list -o output.txt Reference = [1.fasta, 2.fasta, ......... 1000.fasta] Query = [1.fasta, 2.fasta, ......... 1000.fasta] Kmer size = 16 Fragment length = 3000 ANI output file = output.txt INFO, skch::Sketch::build, minimizers picked from reference = 305245397 INFO, skch::Sketch::index, unique minimizers = 7172419 INFO, skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 3440253) ... (529726, 1) INFO, skch::Sketch::computeFreqHist, With threshold 0.001%, ignore minimizers occurring >= 2858 times during lookup. INFO, skch::main, Time spent sketching the reference : 286.582 sec INFO, skch::main, Time spent mapping fragments in query #1 : 412.583 sec INFO, skch::main, Time spent post mapping : 20.5822 sec Command terminated by signal 11 648.89user 7.07system 37:54.26elapsed 28%CPU (0avgtext+0avgdata 7770108maxresident)k 0inputs+0outputs (0major+979813minor)pagefaults 0swaps

(B) For 100 genome, each 3 Mb, output is..

successful run

and

7700.57user 1.39system 2:08:32elapsed 99%CPU (0avgtext+0avgdata 1661348maxresident)k 0inputs+3360outputs (0major+438435minor)pagefaults 0swaps

creating a bash loop or splitting was the final solution, worked very well.

cjain7 commented 6 years ago

Thanks for sharing the info. I tried creating a custom dataset of 1000 E coli genomes at my end but could not reproduce above issue. Let me know if the data you are using is public.