Closed jameslz closed 2 years ago
remove gtdb.3.cf and touch empty gtdb.3.cf , db can be loaded in to memory, but with dump error.
Could you please run centrifuge with --verbose option to check at which step Centrifuge got stuck?
@mourisl
Final policy string: 'SEED=0,22;DPS=15;ROUNDS=2;IVAL=S,1,1.15' Input bt2 file: "centrifuge/gtdb" Query inputs (DNA, FASTQ): Quality inputs: Output file: "" Local endianness: little Sanity checking: disabled Assertions: disabled Entered driver(): 01:08:35 Creating PatternSource: 01:08:35 Opening hit output file: 01:08:35 About to initialize fw Ebwt: 01:08:35 Trying centrifuge/gtdb About to open input files: 01:08:35 Opening "centrifuge/gtdb.1.cf" Opening "centrifuge/gtdb.2.cf" Finished opening input files: 01:08:35 Reading header: 01:08:35 Headers: len: 137422586041 bwtLen: 137422586042 sz: 34355646511 bwtSz: 34355646511 lineRate: 7 offRate: 4 offMask: 0xfffffffffffffff0 ftabChars: 10 eftabLen: 20 eftabSz: 160 ftabLen: 1048577 ftabSz: 8388616 offsLen: 8588911628 offsSz: 68711293024 lineSz: 128 sideSz: 128 sideBwtSz: 96 sideBwtLen: 384 numSides: 357871318 numLines: 357871318 ebwtTotLen: 45807528704 ebwtTotSz: 45807528704 color: 0 reverse: 0 Reading plen (38342): 01:08:35 Opening "centrifuge/gtdb.3.cf"
rebuild the database, the same, stuck at "Opening "centrifuge/gtdb.3.cf"".
@mourisl
Now, Finish the classification, need 1h to load 3.cf file..
Applying preset: 'sensitive' using preset menu 'V0' Final policy string: 'SEED=0,22;DPS=15;ROUNDS=2;IVAL=S,1,1.15' Input bt2 file: "/project/wol/gtdb" Query inputs (DNA, FASTQ): test/A3_1.fq Quality inputs: Output file: "test.txt" Local endianness: little Sanity checking: disabled Assertions: disabled Entered driver(): 02:03:39 Creating PatternSource: 02:03:39 Opening hit output file: 02:03:39 About to initialize fw Ebwt: 02:03:39 Trying /project/wol/gtdb About to open input files: 02:03:39 Opening "/project/wol/gtdb.1.cf" Opening "/project/wol/gtdb.2.cf" Finished opening input files: 02:03:39 Reading header: 02:03:39 Headers: len: 162982612310 bwtLen: 162982612311 sz: 40745653078 bwtSz: 40745653078 lineRate: 7 offRate: 4 offMask: 0xfffffffffffffff0 ftabChars: 10 eftabLen: 20 eftabSz: 160 ftabLen: 1048577 ftabSz: 8388616 offsLen: 10186413270 offsSz: 81491306160 lineSz: 128 sideSz: 128 sideBwtSz: 96 sideBwtLen: 384 numSides: 424433887 numLines: 424433887 ebwtTotLen: 54327537536 ebwtTotSz: 54327537536 color: 0 reverse: 0 Reading plen (47894): 02:03:39 Opening "/project/wol/gtdb.3.cf" Opening "/project/wol/gtdb.4.cf" About to open input files: 02:55:30 Opening "/project/wol/gtdb.1.cf" Opening "/project/wol/gtdb.2.cf" Finished opening input files: 02:55:30 Reading header: 02:55:30 Reading plen (47894): 02:55:30 Reading rstarts (40316139): 02:55:30 Reading ebwt (54327537536): 02:55:30 Reading fchr (5) Reading ftab (1048577): 02:57:03 Reading eftab (20): 02:57:04 Reading offs (10186413270 64-bit words): 02:57:04 Headers: len: 162982612310 bwtLen: 162982612311 sz: 40745653078 bwtSz: 40745653078 lineRate: 7 offRate: 4 offMask: 0xfffffffffffffff0 ftabChars: 10 eftabLen: 20 eftabSz: 160 ftabLen: 1048577 ftabSz: 8388616 offsLen: 10186413270 offsSz: 81491306160 lineSz: 128 sideSz: 128 sideBwtSz: 96 sideBwtLen: 384 numSides: 424433887 numLines: 424433887 ebwtTotLen: 54327537536 ebwtTotSz: 54327537536 color: 0 reverse: 0 creating patternsource for 1-th input: 02:57:50 Dispatching to search driver: 02:57:50 report file centrifuge_report.tsv Number of iterations in EM algorithm: 69 Probability diff. (P - P_prev) in the last iteration: 3.01961e-11 Calculating abundance: 00:00:09
How many species are there in your taxonomy tree? There could be some efficiency issue when processing .3.cf file, which contains the taxonomy information. Thank you.
We use GTDB rep genomes, 47,894 genomes, 47,894 species and 47,894 sequence.
I just updated the method for processing the 3.cf file. Could you git pull, recompile centrifuge and give it a try? You don't need to rebuild the index. Thank you.
It works, Thank you, just need minutes to finish the classification.
Applying preset: 'sensitive' using preset menu 'V0' Final policy string: 'SEED=0,22;DPS=15;ROUNDS=2;IVAL=S,1,1.15' Input bt2 file: "/project/wol/centrifuge/gtdb" Query inputs (DNA, FASTQ): A3_1.fq Quality inputs: Output file: "test.txt" Local endianness: little Sanity checking: disabled Assertions: disabled Entered driver(): 18:37:10 Creating PatternSource: 18:37:10 Opening hit output file: 18:37:10 About to initialize fw Ebwt: 18:37:10 Trying /project/wol/centrifuge/gtdb About to open input files: 18:37:10 Opening "/project/wol/centrifuge/gtdb.1.cf" Opening "/project/wol/centrifuge/gtdb.2.cf" Finished opening input files: 18:37:10 Reading header: 18:37:10 Headers: len: 162982612310 bwtLen: 162982612311 sz: 40745653078 bwtSz: 40745653078 lineRate: 7 offRate: 4 offMask: 0xfffffffffffffff0 ftabChars: 10 eftabLen: 20 eftabSz: 160 ftabLen: 1048577 ftabSz: 8388616 offsLen: 10186413270 offsSz: 81491306160 lineSz: 128 sideSz: 128 sideBwtSz: 96 sideBwtLen: 384 numSides: 424433887 numLines: 424433887 ebwtTotLen: 54327537536 ebwtTotSz: 54327537536 color: 0 reverse: 0 Reading plen (47894): 18:37:10 Opening "/project/wol/centrifuge/gtdb.3.cf" Opening "/project/wol/centrifuge/gtdb.4.cf" About to open input files: 18:37:10 Opening "/project/wol/centrifuge/gtdb.1.cf" Opening "/project/wol/centrifuge/gtdb.2.cf" Finished opening input files: 18:37:10 Reading header: 18:37:10 Reading plen (47894): 18:37:10 Reading rstarts (40316139): 18:37:10 Reading ebwt (54327537536): 18:37:10 Reading fchr (5) Reading ftab (1048577): 18:38:51 Reading eftab (20): 18:38:51 Reading offs (10186413270 64-bit words): 18:38:51 Headers: len: 162982612310 bwtLen: 162982612311 sz: 40745653078 bwtSz: 40745653078 lineRate: 7 offRate: 4 offMask: 0xfffffffffffffff0 ftabChars: 10 eftabLen: 20 eftabSz: 160 ftabLen: 1048577 ftabSz: 8388616 offsLen: 10186413270 offsSz: 81491306160 lineSz: 128 sideSz: 128 sideBwtSz: 96 sideBwtLen: 384 numSides: 424433887 numLines: 424433887 ebwtTotLen: 54327537536 ebwtTotSz: 54327537536 color: 0 reverse: 0 creating patternsource for 1-th input: 18:39:03 Dispatching to search driver: 18:39:03 report file centrifuge_report.tsv Number of iterations in EM algorithm: 69 Probability diff. (P - P_prev) in the last iteration: 3.01961e-11 Calculating abundance: 00:00:05
centrifuge-build works,
but for reads classification, It take long long time, and the index can't be not loaded into memory, what's the problem?