fbreitwieser / krakenuniq

🐙 KrakenUniq: Metagenomics classifier with unique k-mer counting for more specific results
GNU General Public License v3.0
221 stars 44 forks source link

usage of work-on-disk #82

Open braffes opened 3 years ago

braffes commented 3 years ago

HI,

I see there is an option work-on-disk to use less RAM. But when I tried to use this option, the software tell me this message:

srun --mem=300G --cpus-per-task=10  krakenuniq-build --db DBDIR  --work-on-disk  --verbose --threads 10                                                                                                                                                              
Found jellyfish v1.1.12
Kraken build set to minimize RAM usage.
Found 1 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory.
Creating k-mer set (step 1 of 6)...
Using jellyfish
Hash size not specified, using '2574908690'
K-mer set created. [5m10.872s]
Skipping step 2, no database reduction requested.
Sorting k-mer set (step 3 of 6)...
db_sort: Getting database into memory ...Loaded database with 2505421358 keys with k of 31 [val_len 4, key_len 8].
Loaded database with 2505421358 keys with k of 31 [val_len 4, key_len 8].
db_sort: Sorting ...db_sort: Sorting complete - writing database to disk ...
K-mer set sorted. [36m23.975s]
Creating seqID to taxID map (step 4 of 6)..
1278 sequences mapped to taxa. [0.013s]
Creating taxDB (step 5 of 6)...
Building taxonomy index from taxonomy//nodes.dmp and taxonomy//names.dmp. Done, got 2361119 taxa
taxDB construction finished. [2m59.077s]
Building  KrakenUniq LCA database (step 6 of 6)...
Reading taxonomy index from taxDB. Done.
You need to operate in RAM (flag -M) to use output to a different file (flag -o)
xargs: cat: terminated by signal 13

As explained in the log, it says to not use the flag -o, but I am not using it. Is it normal?

The version of krakenuniq is 0.5.8 I am using slurm as a scheduler and the operating system is Centos 8

Thanks for your attention,

Brice

nick-youngblut commented 2 years ago

@braffes did you ever find a fix to this issue?

braffes commented 2 years ago

I removed the option --work-on-disk. I don't search how to implement a fix for this issue sorry.

nick-youngblut commented 2 years ago

Thanks @braffes for the quick reply! It's a bummer that --work-on-disk currently doesn't work. This limits one's ability to create large krakenuniq databases

salzberg commented 2 years ago

Nick, we are in the process of storing several very large (up to 390GB) KrakenUniq databases on Amazon, so you can simply download them rather than having to build them. They're already up there but we need to check them out first, and then we'll put a link on the KrakenUniq github site. We'll have the links here as well: https://benlangmead.github.io/aws-indexes/

nick-youngblut commented 2 years ago

Do you have a KrakenUniq reference database for all reference species in GTDB-release207? That is what I'm currently working on

salzberg commented 2 years ago

no, not that one - you'll have to create it. (I'm not sure what GTDB is.)

nick-youngblut commented 2 years ago

The GTDB is a newer, sane taxonomy for bacteria and archaea, in which the taxonomy is directly defined from the genome phylogeny: https://gtdb.ecogenomic.org/

I'm going to need --work-on-disk to create the database. I ran out of memory on a node with 1 Tb of memory when running krakenuniq-build

nick-youngblut commented 2 years ago

@salzberg any progress on fixing the work-on-disk issue? I'd be happy to help, if possible