DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
235 stars 73 forks source link

centrifuge-build works, but can't reads into memory #214

Closed jameslz closed 2 years ago

jameslz commented 2 years ago

centrifuge-build works,

centrifuge-build -p 16 --bmax 1342177280  --conversion-table gtdb_reps.taxid --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp gtdb_reps.dustmasker.fna  gtdb

but for reads classification, It take long long time, and the index can't be not loaded into memory, what's the problem?

centrifuge  -x  gtdb  -r  test.fq
jameslz commented 2 years ago

remove gtdb.3.cf and touch empty gtdb.3.cf , db can be loaded in to memory, but with dump error.

mourisl commented 2 years ago

Could you please run centrifuge with --verbose option to check at which step Centrifuge got stuck?

jameslz commented 2 years ago

@mourisl

Final policy string: 'SEED=0,22;DPS=15;ROUNDS=2;IVAL=S,1,1.15' Input bt2 file: "centrifuge/gtdb" Query inputs (DNA, FASTQ): Quality inputs: Output file: "" Local endianness: little Sanity checking: disabled Assertions: disabled Entered driver(): 01:08:35 Creating PatternSource: 01:08:35 Opening hit output file: 01:08:35 About to initialize fw Ebwt: 01:08:35 Trying centrifuge/gtdb About to open input files: 01:08:35 Opening "centrifuge/gtdb.1.cf" Opening "centrifuge/gtdb.2.cf" Finished opening input files: 01:08:35 Reading header: 01:08:35 Headers: len: 137422586041 bwtLen: 137422586042 sz: 34355646511 bwtSz: 34355646511 lineRate: 7 offRate: 4 offMask: 0xfffffffffffffff0 ftabChars: 10 eftabLen: 20 eftabSz: 160 ftabLen: 1048577 ftabSz: 8388616 offsLen: 8588911628 offsSz: 68711293024 lineSz: 128 sideSz: 128 sideBwtSz: 96 sideBwtLen: 384 numSides: 357871318 numLines: 357871318 ebwtTotLen: 45807528704 ebwtTotSz: 45807528704 color: 0 reverse: 0 Reading plen (38342): 01:08:35 Opening "centrifuge/gtdb.3.cf"

jameslz commented 2 years ago

rebuild the database, the same, stuck at "Opening "centrifuge/gtdb.3.cf"".

jameslz commented 2 years ago

@mourisl

Now, Finish the classification, need 1h to load 3.cf file..

Applying preset: 'sensitive' using preset menu 'V0' Final policy string: 'SEED=0,22;DPS=15;ROUNDS=2;IVAL=S,1,1.15' Input bt2 file: "/project/wol/gtdb" Query inputs (DNA, FASTQ): test/A3_1.fq Quality inputs: Output file: "test.txt" Local endianness: little Sanity checking: disabled Assertions: disabled Entered driver(): 02:03:39 Creating PatternSource: 02:03:39 Opening hit output file: 02:03:39 About to initialize fw Ebwt: 02:03:39 Trying /project/wol/gtdb About to open input files: 02:03:39 Opening "/project/wol/gtdb.1.cf" Opening "/project/wol/gtdb.2.cf" Finished opening input files: 02:03:39 Reading header: 02:03:39 Headers: len: 162982612310 bwtLen: 162982612311 sz: 40745653078 bwtSz: 40745653078 lineRate: 7 offRate: 4 offMask: 0xfffffffffffffff0 ftabChars: 10 eftabLen: 20 eftabSz: 160 ftabLen: 1048577 ftabSz: 8388616 offsLen: 10186413270 offsSz: 81491306160 lineSz: 128 sideSz: 128 sideBwtSz: 96 sideBwtLen: 384 numSides: 424433887 numLines: 424433887 ebwtTotLen: 54327537536 ebwtTotSz: 54327537536 color: 0 reverse: 0 Reading plen (47894): 02:03:39 Opening "/project/wol/gtdb.3.cf" Opening "/project/wol/gtdb.4.cf" About to open input files: 02:55:30 Opening "/project/wol/gtdb.1.cf" Opening "/project/wol/gtdb.2.cf" Finished opening input files: 02:55:30 Reading header: 02:55:30 Reading plen (47894): 02:55:30 Reading rstarts (40316139): 02:55:30 Reading ebwt (54327537536): 02:55:30 Reading fchr (5) Reading ftab (1048577): 02:57:03 Reading eftab (20): 02:57:04 Reading offs (10186413270 64-bit words): 02:57:04 Headers: len: 162982612310 bwtLen: 162982612311 sz: 40745653078 bwtSz: 40745653078 lineRate: 7 offRate: 4 offMask: 0xfffffffffffffff0 ftabChars: 10 eftabLen: 20 eftabSz: 160 ftabLen: 1048577 ftabSz: 8388616 offsLen: 10186413270 offsSz: 81491306160 lineSz: 128 sideSz: 128 sideBwtSz: 96 sideBwtLen: 384 numSides: 424433887 numLines: 424433887 ebwtTotLen: 54327537536 ebwtTotSz: 54327537536 color: 0 reverse: 0 creating patternsource for 1-th input: 02:57:50 Dispatching to search driver: 02:57:50 report file centrifuge_report.tsv Number of iterations in EM algorithm: 69 Probability diff. (P - P_prev) in the last iteration: 3.01961e-11 Calculating abundance: 00:00:09

mourisl commented 2 years ago

How many species are there in your taxonomy tree? There could be some efficiency issue when processing .3.cf file, which contains the taxonomy information. Thank you.

jameslz commented 2 years ago

We use GTDB rep genomes, 47,894 genomes, 47,894 species and 47,894 sequence.

mourisl commented 2 years ago

I just updated the method for processing the 3.cf file. Could you git pull, recompile centrifuge and give it a try? You don't need to rebuild the index. Thank you.

jameslz commented 2 years ago

It works, Thank you, just need minutes to finish the classification.

Applying preset: 'sensitive' using preset menu 'V0' Final policy string: 'SEED=0,22;DPS=15;ROUNDS=2;IVAL=S,1,1.15' Input bt2 file: "/project/wol/centrifuge/gtdb" Query inputs (DNA, FASTQ): A3_1.fq Quality inputs: Output file: "test.txt" Local endianness: little Sanity checking: disabled Assertions: disabled Entered driver(): 18:37:10 Creating PatternSource: 18:37:10 Opening hit output file: 18:37:10 About to initialize fw Ebwt: 18:37:10 Trying /project/wol/centrifuge/gtdb About to open input files: 18:37:10 Opening "/project/wol/centrifuge/gtdb.1.cf" Opening "/project/wol/centrifuge/gtdb.2.cf" Finished opening input files: 18:37:10 Reading header: 18:37:10 Headers: len: 162982612310 bwtLen: 162982612311 sz: 40745653078 bwtSz: 40745653078 lineRate: 7 offRate: 4 offMask: 0xfffffffffffffff0 ftabChars: 10 eftabLen: 20 eftabSz: 160 ftabLen: 1048577 ftabSz: 8388616 offsLen: 10186413270 offsSz: 81491306160 lineSz: 128 sideSz: 128 sideBwtSz: 96 sideBwtLen: 384 numSides: 424433887 numLines: 424433887 ebwtTotLen: 54327537536 ebwtTotSz: 54327537536 color: 0 reverse: 0 Reading plen (47894): 18:37:10 Opening "/project/wol/centrifuge/gtdb.3.cf" Opening "/project/wol/centrifuge/gtdb.4.cf" About to open input files: 18:37:10 Opening "/project/wol/centrifuge/gtdb.1.cf" Opening "/project/wol/centrifuge/gtdb.2.cf" Finished opening input files: 18:37:10 Reading header: 18:37:10 Reading plen (47894): 18:37:10 Reading rstarts (40316139): 18:37:10 Reading ebwt (54327537536): 18:37:10 Reading fchr (5) Reading ftab (1048577): 18:38:51 Reading eftab (20): 18:38:51 Reading offs (10186413270 64-bit words): 18:38:51 Headers: len: 162982612310 bwtLen: 162982612311 sz: 40745653078 bwtSz: 40745653078 lineRate: 7 offRate: 4 offMask: 0xfffffffffffffff0 ftabChars: 10 eftabLen: 20 eftabSz: 160 ftabLen: 1048577 ftabSz: 8388616 offsLen: 10186413270 offsSz: 81491306160 lineSz: 128 sideSz: 128 sideBwtSz: 96 sideBwtLen: 384 numSides: 424433887 numLines: 424433887 ebwtTotLen: 54327537536 ebwtTotSz: 54327537536 color: 0 reverse: 0 creating patternsource for 1-th input: 18:39:03 Dispatching to search driver: 18:39:03 report file centrifuge_report.tsv Number of iterations in EM algorithm: 69 Probability diff. (P - P_prev) in the last iteration: 3.01961e-11 Calculating abundance: 00:00:05