ParBLiSS / FastANI

Fast Whole-Genome Similarity (ANI) Estimation
Apache License 2.0
368 stars 66 forks source link

Error: 'std::bad_alloc' #6

Closed SwapnilDoijad closed 6 years ago

SwapnilDoijad commented 6 years ago

fastaANI ran properly with 100 genomes. However, increased to 1000 genomes resulted in the following error


Error details:


$ fastANI --ql 1000_genomes.list --rl 1000_genomes.list -o output.txt

Reference = [1.fasta, 2.fasta, ......... 1000.fasta] Query = [1.fasta, 2.fasta, ......... 1000.fasta] Kmer size = 16 Fragment length = 3000 ANI output file = /media/network/project_Lm_all/results/43_fastANI/output.txt

terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped) fastANI --ql 1000_genomes.list --rl 1000_genomes.list -o output.txt


Hardware details


Processor | 8x Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz Memory | 65858MB Operating System | Ubuntu 16.04.3 LTS

cjain7 commented 6 years ago

std::bad_alloc implies FastANI could not allocate memory when it needed. For 1000 microbial genomes, I expect the memory usage to be much below 66G. Please answer few follow up questions here:

  1. What is the total size of all genomes you have in in 1000_genomes.list? I wonder if it is too big.

  2. Can you provide memory usage of above run? It can be easily obtained by using the /usr/bin/time utility:

/usr/bin/time fastANI --ql 1000_genomes.list --rl 1000_genomes.list -o output.txt

Please make sure you are not running other memory intensive tasks on your system while doing this.

  1. Other thing you may want to try is split the reference list (--rl) 1000_genomes.list into two lists of 500 genomes to reduce the memory use and run them one by one as:

fastANI --ql 1000_genomes.list --rl 500_genomes_first.list -o output_1.txt fastANI --ql 1000_genomes.list --rl 500_genomes_second.list -o output_2.txt cat output_1.txt output_2.txt > output.txt

SwapnilDoijad commented 6 years ago
  1. 1000_genomes.list contains a list of 1000 genomes, each 3 Mb.

  2. After closing all other programs

(A) for 1000 genome

$ /usr/bin/time fastANI --ql 1000_genomes.list --rl 1000_genomes.list -o output.txt Reference = [1.fasta, 2.fasta, ......... 1000.fasta] Query = [1.fasta, 2.fasta, ......... 1000.fasta] Kmer size = 16 Fragment length = 3000 ANI output file = output.txt INFO, skch::Sketch::build, minimizers picked from reference = 305245397 INFO, skch::Sketch::index, unique minimizers = 7172419 INFO, skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 3440253) ... (529726, 1) INFO, skch::Sketch::computeFreqHist, With threshold 0.001%, ignore minimizers occurring >= 2858 times during lookup. INFO, skch::main, Time spent sketching the reference : 286.582 sec INFO, skch::main, Time spent mapping fragments in query #1 : 412.583 sec INFO, skch::main, Time spent post mapping : 20.5822 sec Command terminated by signal 11 648.89user 7.07system 37:54.26elapsed 28%CPU (0avgtext+0avgdata 7770108maxresident)k 0inputs+0outputs (0major+979813minor)pagefaults 0swaps

(B) For 100 genome, each 3 Mb, output is..

successful run

and

7700.57user 1.39system 2:08:32elapsed 99%CPU (0avgtext+0avgdata 1661348maxresident)k 0inputs+3360outputs (0major+438435minor)pagefaults 0swaps

  1. creating a bash loop or splitting was the final solution, worked very well.
cjain7 commented 6 years ago

Thanks for sharing the info. I tried creating a custom dataset of 1000 E coli genomes at my end but could not reproduce above issue. Let me know if the data you are using is public.