BIMSBbioinfo / pigx_sars-cov-2

PiGx SARS-CoV-2 wastewater sequencing pipeline
GNU General Public License v3.0
18 stars 3 forks source link

`build` workflow fails due to missing datatbases #141

Closed jonasfreimuth closed 1 year ago

jonasfreimuth commented 2 years ago

As verbose logs indicate, the current reason for the failure of the build command (log) is due to databases not being found within the runner.

Proposed solution

Adding snakemake rules for dastabase downloads. These should be optional and be controlled via the settings file, but be active by default.

rekado commented 2 years ago

We could also cheat and download the databases in the github action (and cache them). Pretend the databases exist.

That whole reproducibility thing strikes again: when we depend on a URL that is not guaranteed to be an immutable pointer to an immutable resource we're essentially opening up a reproducibility leak. Controlling the download via settings file is necessary, but I'm not sure if it's sufficient.

We also need to ensure that the databases are not downloaded into read-only directories (= where the pipeline might be installed).

jonasfreimuth commented 2 years ago

Well we need to have the databases on the runner, the reproducibility thing is a general problem. To have the databases on the runner, we need to download them at some point and then cache them (I agree on that). I think the cheating option might be a good idea for the short term, but for the long term it would be better to be able to specify everything via the settings file.

For reproducibility, the only thing we can do short of archiving all the databases ourselves, is to record when which version of each database was downloaded from where. I already have a draft of how this could look like here, but with individual rules it should possible to implement this with more robustness...

As for the location of the databases, by default, they should be created where the pipeline is executed, shouldnt they? Right now, that already is where the default database paths point to.