ToniWestbrook / paladin

Protein Alignment and Detection Interface
MIT License
60 stars 7 forks source link

[bwt_pac2bwt] Failed to allocate memory #30

Closed ghost closed 6 years ago

ghost commented 7 years ago

Hi,

I'm trying to build a database from NCBI-NR, but seems like bowtie(?) is failing:

$ paladin index -r3 ../nr.gz
[M::command_index] Translating protein sequence...0.00 sec
[M::command_index] Packing protein sequence... 1735.20 sec
[M::command_index] Constructing BWT for the packed sequence... [bwt_pac2bwt] Failed to allocate 76177038661 bytes at bwtindex.c line 132: Cannot allocate memory

There's 500Gb of memory available, don't see why it couldn't allocate 76Gb. Any suggestions? Thanks!

ToniWestbrook commented 7 years ago

Hello - so, unfortunately the memory allocation issue you're running into is by design. The BWT (Burrows Wheeler Transform) construction process ends up requiring an amount of memory equal to approximately 8x the packed reference size. I just ran an index on the NR, and the packed size ends up being about 80GB, so it would need around 644GB of RAM total to index a reference this size. This is generally why we focused on using clustered references (like the Uniref90). In the cases where we wanted to refine that last 10%, we declustered the hit subjects into their constituent sequences (using PALADIN plugins) and then ran a secondary alignment off those. But if you want to use very large non-clustered references, you'll need a significant amount of memory (and/or swap) to index.

ghost commented 7 years ago

Alright, thanks for the answer! I will try it again with more memory. Unfortunately my query with the clustered reference wasn't very successful in my case. Less than 1% of nanopore reads found a hit which is substantially less than what I get with other DNA and protein based aligners, so I figured I could try to expand the database first.