NBISweden / aMeta

Ancient microbiome snakemake workflow
MIT License
19 stars 15 forks source link

envmodules configuration sections should follow conda environment files #49

Closed percyfal closed 2 years ago

percyfal commented 2 years ago

Currently, there is a separate envmodules key for every rule, but a given conda environment file is shared between rules. Since both keywords solve the same problem, envmodules should be shared between rules, following the conda environment sharing. For instance, envs/malt.yaml is shared between four rules, each of which has a separate envmodules config.

clami66 commented 2 years ago

Agree that this would be nice, I am assuming the conda environment organization reflects some kind of functional modularity (i.e. they are grouped depending on the analyses that are being run?)

percyfal commented 2 years ago

The conda environment is meant to provide the minimum set of (conda) dependencies to run a rule. For some rules, not everything in a environment file is needed, but the differences are so small, and conda envs so costly to install, that the fewer conda env files, the better. Loading an extra environment module or two has little overhead, so we might as well provide the same grouping.

clami66 commented 2 years ago

Right, do we also want to move envmodules.yaml someplace else? I was just looking at the docs and it might make sense to move the section to the runtime config in .profile

clami66 commented 2 years ago

though I don't know if I understood correctly how profiles are meant to be used...

percyfal commented 2 years ago

At their simplest, profiles are a directory with a config.yaml which maps snakemake options as key:val entries. So a config file with

restart-times: 2
max-jobs-per-seconds: 1

translates to the snakemake command line

snakemake --restart-times 2 --max-jobs-per-seconds 1

Many of the options can be fine-tuned for specific environments, but you don't want to retype the everytime since most likely you don't want to change them anyway.

When it comes to submitting jobs to the cluster, there are three snakemake options that matter: --cluster which is a command to submit jobs (shell script that wraps sbatch, python script ...), --cluster-status that polls the slurm controller for job status, and --jobscript which provides a custom jobscript for submission that actually wraps the snakemake command. You don't need these /per se/, but they do provide some additional level of control. The SnakemakeProfiles slurm cookiecutter template provides these scripts, along with templates to setup config.yaml and other files.

Our example config.yaml does not provide the custom scripts mentioned above; rather, I included them to show how to configure rule-specific resources (default-resources, set-threads etc).

Since it is likely that one would want to use the cookiecutter, I would on the one hand advise against putting envmodules in that directory as it is prone to be overwritten. OTOH they do fit together. For now I would suggest sticking with envmodules in config, and maybe add support for an environment variable such that it could be placed in e.g. .config/snakemake/envmodules.yaml or similar?

clami66 commented 2 years ago

Yes, I think I had misunderstood how the profiles should be used..

For now I would suggest sticking with envmodules in config, and maybe add support for an environment variable such that it could be placed in e.g. .config/snakemake/envmodules.yaml or similar?

Do you mean a .config/ folder in the installation directory? That or any other "global" location that would be suitable (maybe workflow/envs/?)

percyfal commented 2 years ago

No, I meant the "regular" config directory would be the default location, if no other parameters have been set. What do we think will be the user-case scenario? You have an analysis folder (separate from the repo) in which there is a config directory with config.yaml and samples.tsv, and the envmodules.yaml file residing in some directory accessible to all analyses. This doesn't necessarily have to be in the repo. I'm open to any suggestions, but hopefully it will become clearer when we test multiple projects.

Following up on the environment variable discussion, this is what the help for snakemake --profile flag says:

 --profile PROFILE     Name of profile to use for configuring Snakemake.
                        Snakemake will search for a corresponding folder in
                        /etc/xdg/xdg-lxqt/snakemake and
                        /home/peru/.config/snakemake. Alternatively, this can
                        be an absolute or relative path. The profile folder
                        has to contain a file 'config.yaml'. This file can be
                        used to set default values for command line options in
                        YAML format. For example, '--cluster qsub' becomes
                        'cluster: qsub' in the YAML file. Profiles can be
                        obtained from https://github.com/snakemake-profiles.
                        The profile can also be set via the environment
                        variable $SNAKEMAKE_PROFILE. [env var:
                        SNAKEMAKE_PROFILE] (default: None)

So there is already a use case where one puts config in a .config directory. I guess this could also be the current working directory, but would it be confusing to have both a .config and a config directory to keep track of?

percyfal commented 2 years ago

Closed via #50