DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
237 stars 73 forks source link

Classification with the nt database memory issues #148

Closed Piplopp closed 5 years ago

Piplopp commented 5 years ago

Hello !

I'm currently trying to use your tool with the nt database and was wondering what kind of machine do I need to run it properly.

My current machine have 150Gb RAM and 32 cpus but failed to build the nt database (I was following the detailed steps from the recentrifuge github pages) but after 500+hours and no informations in the logs I killed the process and downloaded your version of the nt database (~100Gb uncompressed).

When reading at your paper you recommend 128Gb for the nt databse but I still run on this error with 150Gb:

Out of memory allocating the offs[] array for the Bowtie index. Please try again on a computer with more memory. Error: Encountered internal Centrifuge exception (#1) Command: /usr/local/bin/centrifuge-class --wrapper basic-0 -q -p 32 -x nt_dl -S seq1_unambiguous.out seq1_unambiguous.fq.gz (ERR): centrifuge-class exited with value 1

I tried lowering the threads down to 1 but still got the same error. Do you know how I could solve this or how much memory do I need ? Also the nt building issue is a bit problematic but maybe it's due to me not having enough RAM.

Thanks a lot

mourisl commented 5 years ago

Can you run centrifuge in verbose mode(-v)? There should be some information about how large the offs array would be.

Piplopp commented 5 years ago

I did, here are the results (the full log is below but I tried to do a "digest" version with all relevant informations about offs I could find)

Also a quick note: the -v option for verbose does not work, we need to use the --verbose one.

...
Headers:
...
    offsLen: 10082534135
    offsSz: 80660273080
...
Reading offs (10082534135 64-bit words): 09:49:50
Out of memory allocating the offs[] array  for the Bowtie index.
...

Full log

(INFO): Before arg handling:
(INFO):   Wrapper args:
[  ]
(INFO):   Binary args:
[ -q --verbose -p 32 -x nt_dl -S seq1_unambiguous.out seq1_unambiguous.fq.gz ]
(INFO): After arg handling:
(INFO):   Binary args:
[ -q --verbose -p 32 -x nt_dl -S seq1_unambiguous.out seq1_unambiguous.fq.gz ]
(INFO): Cannot find any index option (--reference-string, --ref-string or -x) in the given command line.
(INFO): /usr/local/bin/centrifuge-class --wrapper basic-0 -q --verbose -p 32 -x nt_dl -S seq1_unambiguous.out seq1_unambiguous.fq.gz
Applying preset: 'sensitive' using preset menu 'V0'
Final policy string: 'SEED=0,22;DPS=15;ROUNDS=2;IVAL=S,1,1.15'
Input bt2 file: "nt_dl"
Query inputs (DNA, FASTQ):
  seq1_unambiguous.fq.gz
Quality inputs:
Output file: "seq1_unambiguous.out"
Local endianness: little
Sanity checking: disabled
Assertions: disabled
Entered driver(): 09:42:04
Creating PatternSource: 09:42:04
Opening hit output file: 09:42:04
About to initialize fw Ebwt: 09:42:04
Trying nt_dl
  About to open input files: 09:42:04
Opening "nt_dl.1.cf"
Opening "nt_dl.2.cf"
  Finished opening input files: 09:42:04
  Reading header: 09:42:04
Headers:
    len: 161320546158
    bwtLen: 161320546159
    sz: 40330136540
    bwtSz: 40330136540
    lineRate: 7
    offRate: 4
    offMask: 0xfffffffffffffff0
    ftabChars: 14
    eftabLen: 28
    eftabSz: 224
    ftabLen: 268435457
    ftabSz: 2147483656
    offsLen: 10082534135
    offsSz: 80660273080
    lineSz: 128
    sideSz: 128
    sideBwtSz: 96
    sideBwtLen: 384
    numSides: 420105589
    numLines: 420105589
    ebwtTotLen: 53773515392
    ebwtTotSz: 53773515392
    color: 0
    reverse: 0
Reading plen (46203723): 09:42:04
Opening "nt_dl.3.cf"
Opening "nt_dl.4.cf"
  About to open input files: 09:43:05
Opening "nt_dl.1.cf"
Opening "nt_dl.2.cf"
  Finished opening input files: 09:43:05
  Reading header: 09:43:05
Reading plen (46203723): 09:43:05
Reading rstarts (777423930): 09:43:06
Reading ebwt (53773515392): 09:43:33
Reading fchr (5)
Reading ftab (268435457): 09:47:06
Reading eftab (28): 09:47:19
Reading offs (10082534135 64-bit words): 09:49:50
Out of memory allocating the offs[] array  for the Bowtie index.
Please try again on a computer with more memory.
Error: Encountered internal Centrifuge exception (#1)
Command: /usr/local/bin/centrifuge-class --wrapper basic-0 -q --verbose -p 32 -x nt_dl -S seq1_unambiguous.out seq1_unambiguous.fq.gz 
(ERR): centrifuge-class exited with value 1
bartns commented 5 years ago

Hi, Last time I used it on the 3/3/2018 nt db, memory usage peaked at ~195GB .. (10 threads, running with compressed fastq) Paper is from 2016, so I guess the recommendation in the paper just doesn't hold anymore...

Would be nice to have the memory usage a bit lower though ;)

Piplopp commented 5 years ago

Yep, I indeed needed around 200GB on compressed fastq