Closed nick-youngblut closed 2 years ago
As a test of reproducibility, I killed the krakenuniq-build
job at the end of the above post (Getting database0.kdb into memory (347.204 GB) ...
), and I instead tried used krakenuniq-build --work-on-disk
again to make sure that it would generate the same output as above:
Kraken build set to minimize RAM usage.
Found 10000 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory.
Skipping step 1, k-mer set already exists.
Skipping step 2, no database reduction requested.
Skipping step 3, k-mer set already sorted.
Skipping step 4, seqID to taxID map already complete.
Creating taxDB (step 5 of 6)...
taxDB construction finished. [2.846s]
Building KrakenUniq LCA database (step 6 of 6)...
...however, krakenuniq-build --work-on-disk
instead produced the following output:
Found jellyfish v1.1.12
Kraken build set to minimize RAM usage.
Found 10000 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory.
Skipping step 1, k-mer set already exists.
Skipping step 2, no database reduction requested.
Skipping step 3, k-mer set already sorted.
Skipping step 4, seqID to taxID map already complete.
Skipping step 5, taxDB exists.
Building KrakenUniq LCA database (step 6 of 6)...
Reading taxonomy index from taxDB. Done.
You need to operate in RAM (flag -M) to use output to a different file (flag -o)
xargs: cat: terminated by signal 13
I get the same error with krakenuniq=0.6
when starting a krakenuniq-build
job on a new library (using --work-on-disk
):
Found jellyfish v1.1.12
Kraken build set to minimize RAM usage.
Finding all library files
Found 500 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory.
Creating k-mer set (step 1 of 6)...
Using jellyfish
Hash size not specified, using '1637986465'
K-mer set created. [8m10.272s]
Skipping step 2, no database reduction requested.
Sorting k-mer set (step 3 of 6)...
db_sort: Getting database into memory ...Loaded database with 1623560677 keys with k of 31 [val_len 4, key_len 8].
Loaded database with 1623560677 keys with k of 31 [val_len 4, key_len 8].
db_sort: Sorting ...db_sort: Sorting complete - writing database to disk ...
K-mer set sorted. [22m46.406s]
Creating seqID to taxID map (step 4 of 6)..
61039 sequences mapped to taxa. [3.395s]
Creating taxDB (step 5 of 6)...
Building taxonomy index from taxonomy//nodes.dmp and taxonomy//names.dmp. Done, got 401815 taxa
taxDB construction finished. [3.468s]
Building KrakenUniq LCA database (step 6 of 6)...
Reading taxonomy index from taxDB. Done.
You need to operate in RAM (flag -M) to use output to a different file (flag -o)
xargs: cat: terminated by signal 13
Please check --work-on-disk option in the latest release v0.7.3, it should work properly now.
With v0.7.3, I'm still getting the error described at https://github.com/fbreitwieser/krakenuniq/issues/52. My build directory includes:
database-build.log
database.jdb
database0.kdb
database_0
database_1
library/
library-files.txt
seqid2taxid-plus.map
seqid2taxid.map
taxDB
taxonomy/
What is your command? I would like to reproduce the error.
On Thu, Jun 23, 2022 at 8:48 AM Nick Youngblut @.***> wrote:
I get the following error when using --work-on-disk with v0.7.3:
Kraken build set to minimize RAM usage. Found 500 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory. Skipping step 1, k-mer set already exists. Skipping step 2, no database reduction requested. Skipping step 3, k-mer set already sorted. Skipping step 4, seqID to taxID map already complete. Skipping step 5, taxDB exists. Building KrakenUniq LCA database (step 6 of 6)... Reading taxonomy index from taxDB. Done. Loaded database with 1623560677 keys with k of 31 [val_len 4, key_len 8]. set_lcas: unable to open database.idx: No such file or directory xargs: cat: terminated by signal 13
Not such error occurs if I don't use --work-on-disk:
Kraken build set to minimize disk writes. Found 500 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory. Skipping step 1, k-mer set already exists. Skipping step 2, no database reduction requested. Skipping step 3, k-mer set already sorted. Skipping step 4, seqID to taxID map already complete. Skipping step 5, taxDB exists. Building KrakenUniq LCA database (step 6 of 6)... Reading taxonomy index from taxDB. Done. Getting database0.kdb into memory (18.145 GB) ...
— Reply to this email directly, view it on GitHub https://github.com/fbreitwieser/krakenuniq/issues/97#issuecomment-1164366615, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXGHNHEPINEJMJAC4EJKTVQRMI5ANCNFSM5YFNBP6Q . You are receiving this because you commented.Message ID: @.***>
-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com
A simple ./krakenuniq-build --kmer-len 31 --build --threads 12 --db $DB
, with $DB
denoting the database base directory path.
Using --rebuild
does not help (just checked again)
The command I have been using to test was:
krakenuniq-build --db . --threads 32 --work-on-disk
I have library and taxonomy folders in the current dir. I will test with library and taxonomy in another folder
On Thu, Jun 23, 2022 at 8:55 AM Nick Youngblut @.***> wrote:
A simple ./krakenuniq-build --kmer-len 31 --build --threads 12 --db $DB, with $DB denoting the database base directory path.
Using --rebuild does not help (just checked again)
— Reply to this email directly, view it on GitHub https://github.com/fbreitwieser/krakenuniq/issues/97#issuecomment-1164374283, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXGHL3RGJOLVQKL4RDQULVQRNE5ANCNFSM5YFNBP6Q . You are receiving this because you commented.Message ID: @.***>
-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com
I tried krakenuniq-build --db . --threads 32 --work-on-disk
in the appropriate directory, but I still got the same error.
Maybe it's due to how I'm adding genomes to the library? My simple helper script for that:
#!/usr/bin/env python
from __future__ import print_function
import os
import sys
import re
import gzip
import bz2
import argparse
import logging
# logging
logging.basicConfig(format='%(asctime)s - %(message)s', level=logging.DEBUG)
# argparse
class CustomFormatter(argparse.ArgumentDefaultsHelpFormatter,
argparse.RawDescriptionHelpFormatter):
pass
desc = 'Adding genome to krakenuniq database'
epi = """DESCRIPTION:
Write output files to db_dir:
* renamed genome fasta (all special characters removed from names)
* krakenuniq map file
"""
parser = argparse.ArgumentParser(description=desc, epilog=epi,
formatter_class=CustomFormatter)
parser.add_argument('fasta_file', type=str,
help='Input genome fasta file')
parser.add_argument('taxid', type=str,
help='Taxonomy ID for the genome')
parser.add_argument('sample', type=str,
help='Genome name')
parser.add_argument('db_dir', type=str,
help='Output database location (e.g., ku_db/library/)')
parser.add_argument('--version', action='version', version='0.0.1')
# functions
def _open(infile, mode='rb'):
"""
Openning of input, regardless of compression
"""
if infile.endswith('.bz2'):
return bz2.open(infile, mode)
elif infile.endswith('.gz'):
return gzip.open(infile, mode)
else:
return open(infile)
def copy_genome(infile, outdir, sample):
outfile = os.path.join(outdir, sample + '.fna')
regex = re.compile(r'[^>A-Za-z0-9-\n]')
gz = infile.endswith('.gz')
contigs = list()
with _open(infile) as inF, open(outfile, 'w') as outF:
for line in inF:
if gz:
line = line.decode('utf-8')
# seq header
if line.startswith('>'):
line = regex.sub('_', line)
contigs.append(line.lstrip('>').rstrip())
# writing to output directory
outF.write(line)
logging.info(f'File written: {outfile}')
# return
return contigs
def write_map(contigs, outdir, sample, taxid):
outfile = os.path.join(outdir, sample + '.map')
with open(outfile, 'w') as outF:
for contig in contigs:
outF.write('\t'.join([contig, taxid, sample]) + '\n')
logging.info(f'File written: {outfile}')
## main interface function
def main(args):
if not os.path.isdir(args.db_dir):
os.makedirs(args.db_dir)
contigs = copy_genome(args.fasta_file, args.db_dir, args.sample)
write_map(contigs, args.db_dir, args.sample, args.taxid)
## script main
if __name__ == '__main__':
args = parser.parse_args()
main(args)
It is possible. The command worked fine for me just now, see below.
@.** test_krakenuniq]$ krakenuniq-build --db DBDIR --threads 32 --work-on-disk Kraken build set to minimize RAM usage. Finding all library files Found 1 sequence files (.{fna,fa,ffn,fasta,fsa}) in the library directory. Creating k-mer set (step 1 of 6)... Using /ccb/sw/bin/jellyfish-install/bin/jellyfish Hash size not specified, using '2575692630' K-mer set created. [13m43.538s] Skipping step 2, no database reduction requested. Sorting k-mer set (step 3 of 6)... db_sort: Getting database into memory ...Loaded database with 2505641687 keys with k of 31 [val_len 4, key_len 8]. Loaded database with 2505641687 keys with k of 31 [val_len 4, key_len 8]. db_sort: Sorting ...db_sort: Sorting complete - writing database to disk ... K-mer set sorted. [48m52.013s] Creating seqID to taxID map (step 4 of 6).. 705 sequences mapped to taxa. [0.059s] Creating taxDB (step 5 of 6)... Building taxonomy index from taxonomy//nodes.dmp and taxonomy//names.dmp. Done, got 2426193 taxa taxDB construction finished. [1m4.789s] Building KrakenUniq LCA database (step 6 of 6)... Reading taxonomy index from taxDB. Done. Loaded database with 2505641687 keys with k of 31 [val_len 4, key_len 8]. Reading sequence ID to taxonomy ID mapping ... got 705 mappings. Finished processing 705 sequences (skipping 0 empty sequences, and 0 sequences with no taxonomy mapping) Writing kmer counts to database.kdb.counts... LCA database created. [28m27.253s] Creating database summary report database.report.tsv ... /ccb/sw/bin/classify -d ././database.kdb -i ././database.idx -t 32 -r database.report.tsv -a ././taxDB -p 12 Database ././database.kdb Loaded database with 2505641687 keys with k of 31 [val_len 4, key_len 8]. Reading taxonomy index from ././taxDB. Done. 705 sequences (3298.43 Mbp) processed in 153.354s (0.3 Kseq/m, 1290.51 Mbp/m). 705 sequences classified (100.00%) 0 sequences unclassified (0.00%) Writing report file to database.report.tsv .. Reading genome sizes from ././database.kdb.counts ... done Setting values in the taxonomy tree ... done Printing classification report ... done Report finished in 0.006 seconds. Finishing up ...Database construction complete. [Total: 1h36m33.683s] You can delete all files but database.{kdb,idx} and taxDB now, if you want
Here are the contents of DBDIR:
@.** test_krakenuniq]$ ls DBDIR/ DBDIR/database0.kdb DBDIR/database.idx DBDIR/database.kdb DBDIR/database.kraken.tsv DBDIR/library-files.txt DBDIR/taxDB DBDIR/database-build.log DBDIR/database.jdb DBDIR/database.kdb.counts DBDIR/database.report.tsv DBDIR/seqid2taxid.map
DBDIR/library: vertebrate_mammalian
DBDIR/taxonomy: citations.dmp database-build.log delnodes.dmp division.dmp gc.prt gencode.dmp merged.dmp names.dmp nodes.dmp readme.txt taxdump.tar.gz
On Thu, Jun 23, 2022 at 9:16 AM Nick Youngblut @.***> wrote:
I tried krakenuniq-build --db . --threads 32 --work-on-disk in the appropriate directory, but I still got the same error.
Maybe it's due to how I'm adding genomes to the library? My simple helper script for that:
!/usr/bin/env python
from future import print_function import os import sys import re import argparse import logging
logging
logging.basicConfig(format='%(asctime)s - %(message)s', level=logging.DEBUG)
argparse
class CustomFormatter(argparse.ArgumentDefaultsHelpFormatter, argparse.RawDescriptionHelpFormatter): pass
desc = 'Adding genome to krakenuniq database' epi = """DESCRIPTION: Write output files to db_dir:
- renamed genome fasta (all special characters removed from names)
- krakenuniq map file """ parser = argparse.ArgumentParser(description=desc, epilog=epi, formatter_class=CustomFormatter) parser.add_argument('fasta_file', type=str, help='Input genome fasta file') parser.add_argument('taxid', type=str, help='Taxonomy ID for the genome') parser.add_argument('sample', type=str, help='Genome name') parser.add_argument('db_dir', type=str, help='Output database location (e.g., ku_db/library/)') parser.add_argument('--version', action='version', version='0.0.1')
def copy_genome(infile, outdir, sample): outfile = os.path.join(outdir, sample + '.fna') regex = re.compile(r'[^>A-Za-z0-9-\n]') gz = infile.endswith('.gz') contigs = list() with _open(infile) as inF, open(outfile, 'w') as outF: for line in inF: if gz: line = line.decode('utf-8')
seq header
if line.startswith('>'): line = regex.sub('_', line) contigs.append(line.lstrip('>').rstrip()) # writing to output directory outF.write(line) logging.info(f'File written: {outfile}') # return return contigs
def write_map(contigs, outdir, sample, taxid): outfile = os.path.join(outdir, sample + '.map') with open(outfile, 'w') as outF: for contig in contigs: outF.write('\t'.join([contig, taxid, sample]) + '\n') logging.info(f'File written: {outfile}')
main interface function
def main(args): if not os.path.isdir(args.db_dir): os.makedirs(args.db_dir) contigs = copy_genome(args.fasta_file, args.db_dir, args.sample) write_map(contigs, args.db_dir, args.sample, args.taxid)
script main
if name == 'main': args = parser.parse_args() main(args)
— Reply to this email directly, view it on GitHub https://github.com/fbreitwieser/krakenuniq/issues/97#issuecomment-1164396640, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXGHLAYYFAF6PS2C7TTYLVQRPUJANCNFSM5YFNBP6Q . You are receiving this because you commented.Message ID: @.***>
-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com
I tried creating a new krakenuniq library, and now I'm getting the following:
krakenuniq-build --kmer-len 31 --build --threads 12 --db $DB
Kraken build set to minimize disk writes.
Found 10 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory.
Creating k-mer set (step 1 of 6)...
Using /tmp/global2/nyoungblut/code/dev/Struo2/bin/scripts/krakenuniq/jellyfish-install/bin/jellyfish
Hash size not specified, using '32573424'
/tmp/global2/nyoungblut/code/dev/Struo2/bin/scripts/krakenuniq/jellyfish-install/bin/jellyfish: error while loading shared libraries: libjellyfish-1.1.so.1: cannot open shared object file: No such file or directory
I installed krakenuniq v0.7.3 via:
git clone https://github.com/fbreitwieser/krakenuniq
cd krakenuniq
./install_krakenuniq /PATH/TO/INSTALL_DIR
...since that version isn't on bioconda yet
Did jellyfish compile and install properly? Can you check if /tmp/global2/nyoungblut/code/dev/Struo2/bin/scripts/ krakenuniq/jellyfish-install/bin/jellyfish works? If you have jellyfish1 installed elsewhere, you can specify its path with the appropriate option to build.
On Thu, Jun 23, 2022 at 11:07 AM Nick Youngblut @.***> wrote:
I tried creating a new krakenuniq library, and now I'm getting the following:
krakenuniq-build --kmer-len 31 --build --threads 12 --db $DB Kraken build set to minimize disk writes. Found 10 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory. Creating k-mer set (step 1 of 6)... Using /tmp/global2/nyoungblut/code/dev/Struo2/bin/scripts/krakenuniq/jellyfish-install/bin/jellyfish Hash size not specified, using '32573424' /tmp/global2/nyoungblut/code/dev/Struo2/bin/scripts/krakenuniq/jellyfish-install/bin/jellyfish: error while loading shared libraries: libjellyfish-1.1.so.1: cannot open shared object file: No such file or directory
I installed krakenuniq v0.7.3 via:
git clone https://github.com/fbreitwieser/krakenuniq cd krakenuniq ./install_krakenuniq /PATH/TO/INSTALL_DIR
...since that version isn't on bioconda yet
— Reply to this email directly, view it on GitHub https://github.com/fbreitwieser/krakenuniq/issues/97#issuecomment-1164530869, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXGHIE5LLWCQDRUG7RQB3VQR4THANCNFSM5YFNBP6Q . You are receiving this because you commented.Message ID: @.***>
-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com
There may be a problem with your environment. Simple:
export LD_LIBRARY_PATH=tmp/global2/nyoungblut/code/dev/Struo2/bin/scripts/ krakenuniq/jellyfish-install/lib/
should fix it, but in general it should not be necessary.
On Thu, Jun 23, 2022 at 11:17 AM Aleksey Zimin @.***> wrote:
Did jellyfish compile and install properly? Can you check if /tmp/global2/nyoungblut/code/dev/Struo2/bin/scripts/ krakenuniq/jellyfish-install/bin/jellyfish works? If you have jellyfish1 installed elsewhere, you can specify its path with the appropriate option to build.
On Thu, Jun 23, 2022 at 11:07 AM Nick Youngblut @.***> wrote:
I tried creating a new krakenuniq library, and now I'm getting the following:
krakenuniq-build --kmer-len 31 --build --threads 12 --db $DB Kraken build set to minimize disk writes. Found 10 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory. Creating k-mer set (step 1 of 6)... Using /tmp/global2/nyoungblut/code/dev/Struo2/bin/scripts/krakenuniq/jellyfish-install/bin/jellyfish Hash size not specified, using '32573424' /tmp/global2/nyoungblut/code/dev/Struo2/bin/scripts/krakenuniq/jellyfish-install/bin/jellyfish: error while loading shared libraries: libjellyfish-1.1.so.1: cannot open shared object file: No such file or directory
I installed krakenuniq v0.7.3 via:
git clone https://github.com/fbreitwieser/krakenuniq cd krakenuniq ./install_krakenuniq /PATH/TO/INSTALL_DIR
...since that version isn't on bioconda yet
— Reply to this email directly, view it on GitHub https://github.com/fbreitwieser/krakenuniq/issues/97#issuecomment-1164530869, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXGHIE5LLWCQDRUG7RQB3VQR4THANCNFSM5YFNBP6Q . You are receiving this because you commented.Message ID: @.***>
-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com
-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com
Yeah, the path was just messed up.
The run worked:
Kraken build set to minimize RAM usage.
Found 10 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory.
Skipping step 1, k-mer set already exists.
Skipping step 2, no database reduction requested.
Skipping step 3, k-mer set already sorted.
Skipping step 4, seqID to taxID map already complete.
Skipping step 5, taxDB exists.
Skipping step 6, LCAs already set.
Database construction complete. [Total: 0.014s]
You can delete all files but database.{kdb,idx} and taxDB now, if you want
...but I the set_lcas: unable to open database.idx: No such file or directory
is generated if you try to re-build the database after building (or attempting to build) the database once
Thank you for reporting this bug -- it must have been there for a while. I fixed it, please go to your krakenuniq folder and git pull and reinstall.
On Thu, Jun 23, 2022 at 11:37 AM Nick Youngblut @.***> wrote:
Yeah, the path was just messed up.
The run worked:
Kraken build set to minimize RAM usage. Found 10 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory. Skipping step 1, k-mer set already exists. Skipping step 2, no database reduction requested. Skipping step 3, k-mer set already sorted. Skipping step 4, seqID to taxID map already complete. Skipping step 5, taxDB exists. Skipping step 6, LCAs already set. Database construction complete. [Total: 0.014s] You can delete all files but database.{kdb,idx} and taxDB now, if you want
...but I the set_lcas: unable to open database.idx: No such file or directory is generated if you try to re-build the database after building (or attempting to build) the database once
— Reply to this email directly, view it on GitHub https://github.com/fbreitwieser/krakenuniq/issues/97#issuecomment-1164567125, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXGHIUBNXVXB7UPF6RQ3DVQSAB5ANCNFSM5YFNBP6Q . You are receiving this because you commented.Message ID: @.***>
-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com
Yep, that fixed the issue. Thanks @alekseyzimin for all of your help!
krakenuniq-build
died due to an out-of-memory error:I then tried running
krakenuniq-build --work-on-disk
, and the job took ~5 seconds:...however, the job never generated the
database.kdb
output file. If I instead don't use--work-on-disk
,krakenuniq-build
seems to actually work on producing thedatabase.kdb
output:I'm using
krakenuniq=0.6
due to https://github.com/fbreitwieser/krakenuniq/issues/95