build_kraken `add_existing` fails at least for bacteria group

cschu commented 8 months ago

When running build_kraken, the add_existing process fails with messages such as

rsync_from_ncbi.pl: unexpected FTP path (new server?) for https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/762/265/GCF_000762265.1_ASM76226v1

This seems to be a known, yet unsolved, issue with kraken2, s. here

https://github.com/DerrickWood/kraken2/issues/518

and

https://github.com/DerrickWood/kraken2/issues/797

Patching line 46 in rsync_from_ncbi.pl doesn't do the trick, nor does running with the latest kraken2 version (2.1.3). I haven't yet tested regression to earlier versions.

This is of course rather an issue for the kraken2 author(s), but it currently prevents the kraken2 database installation for medi.

cdiener commented 8 months ago

That's true. I think I also patched kraken2. Have you tried this fix? Also make sure to rerun the full database build. It's a bit unclear what will happen if you run add_existing on a cached build. I could try the new k2 script from kraken2 instead. But it isn't really maintained either.

cschu commented 8 months ago

That fix also didn't do it, but using k2 seems to work (although I will only be able confirm, when viral and bacteria have finished downloading.)

cdiener commented 8 months ago

Good to know. Maybe we should switch to it then. Is the k2 command installed in the conda version by default?

cschu commented 8 months ago

I don't think it is. In my container installation (which uses conda), k2 (and k2mask) are located in the libexec folder so I softlinked them to a location that's in the PATH.

For k2mask to run in this setup, I've also had to modify LD_LIBRARY_PATH, so it includes the /path/to/conda/lib folder.

cdiener commented 8 months ago

Okay that's fine. Can also just add a step to the pipeline that finds k2.

cschu commented 8 months ago

Or have the user set it via params.

cschu commented 8 months ago

So, everything has downloaded fine. Unfortunately, the kraken-build process failed (upcoming issue/PR) and thanks to nextflow's stupid caching behaviour (despite having cache = "lenient" in my run.config) it started to download everything again. At that point I just ran the remaining commands manually. However, I have the feeling that this didn't go correctly, as the self_classify step had every input sequence as unclassified, which I don't assume is expected.

cdiener commented 7 months ago

Yeah, sorry. It's a bit tricky with Kraken and I haven't found out a good way to cache only the files changed by the addition steps. Like you figured out yourself, you can just build again but if you start the pipeline it will download again and that can break the db.

What size does your hash.k2d have? It should be around 500GB otherwise the new download probably reset the db.

cschu commented 7 months ago

Yea, I get that -- it's not completely straightforward. Given the fact that the nextflow cache can be super error-prone, I would probably tend to a self-contained installation script (python/bash) if I had to do it myself. That would allow for much better control over certain steps.

Re the hash.k2d -- that explains a lot now..

Your --max-db-size ${Integer.parseInt(params.max_db_size)*Integer(1e9)} caused a nextflow error for me (with the latest 23.something version, both the Integer.parseInt and the Integer(1e9)) and then I accidentally replaced it with 1000000 instead of 1000000000 (so the hash is 500MB... urgh...). Time to rebuild again.

cdiener commented 7 months ago

Okay, I'll get on this. I will keep future changes in a separate branch until they are tested to avoid those issues.

I agree that this is a hard problem for caching. The nextflow cache is pretty good, it's more the algorithm itself. A workaround would be to recommend not using the resume option or bundling the additions. I wouldn't go with a bash script because it wouldn't resolve the root issue (ending up with a broken DB if intermediate steps fail) and you would lose the resource management and monitoring.

Though you could probably also make the cache work by figuring out which files exactly are being changed and only tracking those. This would for instance ensure that a failed kraken2-build run won't invalidate the fasta file addition.

cdiener commented 7 months ago

The integer parsing should be fixed now.

cschu commented 7 months ago

Okay, I'll get on this. I will keep future changes in a separate branch until they are tested to avoid those issues.

Makes perfect sense

The nextflow cache is pretty good, it's more the algorithm itself.

No matter what part of it -- from experience it doesn't do what it should in too many cases, even with lenient option set. In the past 3 years I have witnessed the waste of literally tens of thousands of computations of various sizes due to resume-failures (Nevertheless, nf is still my go-to workflow manager as Snakemake is worse and all the other options are not really options...)

I don't have concrete evidence for this, but some of my recent experiments (aside from trying to run MEDI) give me reason to think that using as process input a directory with changing content (such as the db directory, which is modified and passed around between processes in the MEDI installation workflows) may cause problems with the cache as well.

A workaround would be to recommend not using the resume option or bundling the additions.

Yea, that is a bit of a stretch, though. Each failure would mean having to start from scratch. People would give up on the installation quite quickly.

I wouldn't go with a bash script because it wouldn't resolve the root issue (ending up with a broken DB if intermediate steps fail) you would lose the resource management and monitoring.

The resource management and monitoring wouldn't matter so much for the installation routine, though. I think with a bit of dedicated error handling (assuming that kraken-build or k2 fail gracefully), one could even get the individual download (--add-to-library/--download-library) steps to work.

cdiener commented 7 months ago

I don't have concrete evidence for this, but some of my recent experiments (aside from trying to run MEDI) give me reason to think that using as process input a directory with changing content (such as the db directory, which is modified and passed around between processes in the MEDI installation workflows) may cause problems with the cache as well.

Yes, that is how it works. Basically nextflow runs a stat on the input and if it changes it considers the cache invalidated. Kind of makes sense because Nextflow can't know which changes are okay and which ones are not for maintaining the cache. I probably just need to dig more into what is actually modified by -add-to/download-library and only cache those parts which should resolve the issue with build.

The resource management and monitoring wouldn't matter so much for the installation routine, though. I think with a bit of dedicated error handling (assuming that kraken-build or k2 fail gracefully), one could even get the individual download (--add-to-library/--download-library) steps to work.

I think I don't understand what you mean by installation routine. Did you mean the Kraken2 installation?

cschu commented 7 months ago

I think I don't understand what you mean by installation routine. Did you mean the Kraken2 installation?

I mean build_kraken.nf - i.e. part 2 of the MEDI "database installation routine" :)

Gibbons-Lab / medi

build_kraken `add_existing` fails at least for bacteria group #4