Open tbooth opened 1 year ago
Conda is a common and useful tool, but it is simply invoked, not introduced. Explain what it is (a Python distribution with virtual environment isolation), how it helps (simplifies dependency management), and how to use it.
I'm not sure the setup page is the place for a Conda tutorial, but I've linked to Ep 10 which has this info.
The instructions as written appear to update an existing environment, not create a new one.
I've emphasised that conda env update
really does create the environment. I agree that many people would assume this command would only update an existing environment. Conda has many quirks.
The environment is named "snakemake_dash". Why?
Good point. A relic of where the material originated. I've renamed it to `snakemake_capentry' and modified/removed some other references to the DaSH project.
The conda_env.yaml file contains a whole lot of specific packages. Consider filtering this to specify just those packages you would install manually: snakemake, fastqc, kallisto, etc. Let conda fill in the full dependency graph.
I did this and made files/conda_env_min.yaml, but I believe that, on Linux at least, using the full manifest is more likely to work and get the desired environment. My experience of Conda is it tends to assume that upgrades of dependent packages are always compatible, but in fact breaking changes are common. The tbb
dependency mentioned in ep 10 is a case in point.
From @cmeesters:
the environment file is frighteningly complex: too many dependencies are defined explicitly - conda (and its implementations) are defined to resolve dependencies automatically. Please restrict yourself to the necessary applications.
versions are pinned for Snakemake and other dependencies, rather than defining minimum versions >=. The way it is, becoming outdated is only a question of time.
new users should be introduced into creating separate environments, a conda environment can be created with the required files in one command
As Snakemake recommends mamba, mamba should be mentioned, too.
the environment file is frighteningly complex: too many dependencies are defined explicitly - conda (and its implementations) are defined to resolve dependencies automatically. Please restrict yourself to the necessary applications.
This is addressed in my comment above. It's not a stylistic choice, but a practical one borne out of experience. Having a frozen Conda environment avoids random breakage as available dependencies change - as long as the packages are on the server it will always work. If the file looks "frightening" it's not my fault - this is just what the files created by conda env export
always look like.
versions are pinned for Snakemake and other dependencies, rather than defining minimum versions >=. The way it is, becoming outdated is only a question of time.
Pinning the Snakemake version is a deliberate choice. Snakemake has been known to change command syntax even outside of major version changes (see https://github.com/snakemake/snakemake/issues/283) and last time I allowed a Snakemake upgrade (earlier this year) I had to go through and make several changes to the course text. We absolutely cannot allow conda to install the latest version of Snakemake - it has to be done as part of manual lesson maintenance, and tested.
As Snakemake recommends mamba, mamba should be mentioned, too.
My understanding is that the explicit mamba
command is obsolete, so if the Snakemake documentation is still recommending running mamba
I think this is what should be updated.
This is already stated in the callout: "Setting solver: libmamba
is nowadays preferred to explicitly running the mamba command."
new users should be introduced into creating separate environments, a conda environment can be created with the required files in one command
This option is addressed in episode 10, which I already link to from the setup page. I don't wan't to start a Conda tutorial within the setup page - the idea here is to provide commands that will work reliably, not present all possible options.
Additionally from @cmeesters
the directory for conda/mamba/micromamba does not need to be created, deviating from the standard setup is not necessary.
I've removed the mkdir
command. I guess we can reasonably assume the user does not have a file named installer.sh
already in their current directory. If they do, these instructions now break.
Ah, but conda env export
is specific to recreate one environment known to be working. This assumption might or might not hold.
Occasionally, scientific software sees bug fixes. When you then need to update one application, and you are facing too many pinned libraries, conda will cowardly refuse to comply as you narrowed the options of the solver. Consider the setup for the "snakemake-tutorial-data" - it only defines the necessary application / libraries. Notice the fallback to pip for pysaml? That is because the library dependencies in conda are temporarily broken.
In your specific case, the opposite happened: I had to downgrade kallisto because on my AMD Ryzen, the current version always crashed with "illegal instruction".
So for better maintainability, I recommend boiling the environment specs to the minimum.
However, if you wish to have this enormous environment file, I will not argue further. After all, the community of developers for this review ought to decide, not a particularly grumpy reviewer. ;-)
The pinning of Snakemake is a different matter: If you strive for long-term maintainability with this course, you should consider keeping up to date with the main tool you are teaching.
As far as the minimal environment specs are concerned, they are here:
https://carpentries-incubator.github.io/snakemake-novice-bioinformatics/files/conda_env_min.yaml
This file isn't new, but its not currently linked in the setup page because what I did was create a new environment from this file and then freeze it to make conda_env.yaml
. If you use the minimal file directly, you'll get something a bit different as package versions change over time, and you'll quite possibly find that things are then out of step with the examples in the episodes.
I don't see that using a minimal environment would help with your Kallisto problem. In fact, the fix here is to re-generate the frozen environment using the more compatible version of Kallisto. Then it will work for everyone. Can you let me know what version works for you on Ryzen?
If you strive for long-term maintainability with this course, you should consider keeping up to date with the main tool you are teaching.
I think we're in full agreement about this. In maintaining the course I have an ongoing job to test the material with the newest Snakemake and make changes as necessary. The process is:
conda_env_min.yaml
conda_env.yaml
But this isn't covered in the setup.md
because it's not the tutor's job, and certainly not the learners job. It's the course maintainers job. So I'm going to add this explanation into CONTRIBUTING.md
and put a link in setup.md
for anyone who is curious to see.
I have fixed the frozen Conda environment to use a more compatible version of Kallisto. Fortunately I was able to do this without changing any of the other software versions, aside from the hdf5 library.
from @tkphd
Conda is a common and useful tool, but it is simply invoked, not introduced. Explain what it is (a Python distribution with virtual environment isolation), how it helps (simplifies dependency management), and how to use it.
The instructions as written appear to update an existing environment, not create a new one. The environment is named "snakemake_dash". Why? The conda_env.yaml file contains a whole lot of specific packages. Consider filtering this to specify just those packages you would install manually: snakemake, fastqc, kallisto, etc. Let conda fill in the full dependency graph.