fbreitwieser / krakenuniq

🐙 KrakenUniq: Metagenomics classifier with unique k-mer counting for more specific results
GNU General Public License v3.0
221 stars 44 forks source link

Large logs #63

Open richardmleggett opened 4 years ago

richardmleggett commented 4 years ago

Hi,

First off, many thanks to the authors for a fantastic tool.

I have an issue with running krakenuniq-build with the --max-db-size option. I'm getting an enormous number of messages along the lines of: kmer found in sequence w/ taxid 1002994464 that is not in database

I'm running in an HPC environment and so far the log is 3.5Tb and growing.

Presumably these messages are related to kmers that are being removed to shrink the database?

But is there a way to turn the messages off, as I'm running out of space...

Also, at what point in the process are these occurring? It looks like the database is built, so is this just a check at the end?

Many thanks, Richard

jvolkening commented 4 years ago

I also receive those messages when using --max-db-size. KrakenUniq removes kmers from the database during shrinking, but then warns that they are missing during the set_lca stage. They are not errors per se and don't affect the final database, but they are annoying, create huge logs, and prevent the user from seeing other useful status messages. The issue is mentioned in #44, but that was over a year ago.

Locally, I have dealt with this temporarily by patching the build_db.sh script to remove the -v (verbose) parameter from all calls to set_lca. This silences the warning in question as well as one other warning about skipping sequences with missing tax IDs. It does not require recompiling anything so I can apply it to a conda installation. Obviously this is a hack and a longer term solution is needed. I would submit a PR removing that warning from set_lcas.cpp, but it doesn't look like there has been any activity here since the last release a year ago.

richardmleggett commented 4 years ago

Thanks, that's really useful.