Open suhrig opened 3 years ago
Thanks for the repro details. Unable to reproduce locally but I suspect that's because NCBI has added many additional SARS-COV2 genomes since I built my VIRUSBreakend database. It'll take me a some time to rebuild (you may have noticed NCBI downloads are unreliable) so I can reproduce.
In the meantime, I've made a reference database available that doesn't have this issue: https://melbourne.figshare.com/articles/dataset/virusbreakenddb_20210401_tar_gz/14782299
Thank you so much for providing a prebuilt database. I can confirm the issue does not happen with your build. I consider this solved and we can close the ticket, unless you want to keep it open to further investigate the issue with a newer build of the database.
I'll keep it open until I fix the underlying issue to work with the latest NCBI neighbour genomes.
Hi Daniel,
I have a sample where Kraken2 reports a hit allegedly originating from SARS-CoV2. It's probably a false positive, but that is not the point. The point is that gridss.kraken.ExtractBestViralReference runs out of memory whenever a hit from SARS-CoV2 is given as input. I assume this is because there are a ton of sequences from this virus in the database, since everyone sequences this virus nowadays, and enumerating all the kmers from all of these sequences probably needs quite some RAM (and also CPU time).
Here is a minimal test case to reproduce:
Make a file
summary_taxa.tsv
with the following content:Then, run the following command:
After a few minutes, the command terminates with:
I had to increase the Java heap space to 50 GB to make it work. It took 4 hours to run, but did succeed in the end.
Regards, Sebastian