CDCgov / datasets-sars-cov-2

Benchmark datasets for WGS analysis of SARS-CoV-2. (https://peerj.com/articles/13821/)
Apache License 2.0
54 stars 18 forks source link

bioconda plans? #1

Closed kapsakcj closed 2 years ago

kapsakcj commented 3 years ago

I'm curious if there are plans to add this tool to bioconda?

It would greatly benefit the community if one could easily install the dependencies via conda or use the docker image that is auto-generated via the biocontainers project.

I'd be happy to write up a Dockerfile, but it would probably reach more users if it was available on bioconda

lskatz commented 3 years ago

No plans at this time for Conda since the perl models are pretty easy. However does it help with installation to define software dependencies? Trying to understand the reasoning to package it in Conda.

Would a container also hold the actual data? That would be pretty nice.

kapsakcj commented 2 years ago

Yes, I think that would help to spell out the perl modules and the minimum required versions.

The reasoning is just for convenience. Ideally it would be a one-line command to install this tool & have instant access to these datasets.

# alternatively use mamba instead of conda
$ conda create --name datasets-sars-cov-2 datasets-sars-cov-2
$ conda activate datasets-sars-cov-2
(datasets-sars-cov-2) $ GenFSGopher.pl [...]

Would a container also hold the actual data? That would be pretty nice.

No, I don't think it would be wise to have a container chock full of fastq files which may amount to multiple GBs of data. I haven't downloaded all this data yet, so I'm not sure how large it actually is.

lskatz commented 2 years ago

Ok so maybe if I package this all correctly as an actual perl package, it will spell out the dependencies correctly. It would follow the syntax perl Makefile.PL && make && make install. Dependencies would be clearly listed in Makefile.PL. Would that address it?

If edirect is a pain point, would it help to have a separate project to make an edirect container?

pvanheus commented 2 years ago

The installation procedure is really quite complex - I agree with @kapsakcj that a conda based install makes sense.

What is make used for besides running the Makefile that installs Spreadsheet::XLSX and Spreadsheet::ParseExcel? And is the built-from-source version of sratoolkit needed or can the sra-tools from bioconda be used? Finally, Perl 5.12.0 is a very old and rather odd dependency - is this specific version required?

pvanheus commented 2 years ago

Ok I figured out what you're doing with make - everything!

I am writing a bioconda package for to install this report and its associated script, but in the meantime, the conda packages that you need are perl make entrez-direct sra-tools wget coreutils

pvanheus commented 2 years ago

PR for bioconda is here: https://github.com/bioconda/bioconda-recipes/pull/30577

rpetit3 commented 2 years ago

This has been merged into bioconda

lskatz commented 2 years ago

You guys are amazing

kapsakcj commented 2 years ago

Would be good to have some documentation in the main README on how to setup the (bio)conda environment & additionally how to run the GenFSGopher.pl script via the docker container (biocontainer)

When I get a chance, I'll start a PR with instructions on docker usage.

cc @pvanheus