MetaboHUB-MetaToul-FluxoMet / RTMet

RTMet is a data workflow to process FIA-MS data coming from a bioreactor, find metabolites and fluxes, and send a feedback command to the system.
https://rtmet.readthedocs.io
GNU General Public License v3.0
2 stars 0 forks source link

Have conda environments directly in workflow source/runs directories #52

Open elliotfontaine opened 1 month ago

elliotfontaine commented 1 month ago

Ten simple rules and a template for creating workflows-as-applications #Rule 7

"When creating software environments, many workflow managers will save these within subfolders in the working directory. This facilitates reproducibility by keeping everything within a single subdirectory. As a result, every new analysis will generate a whole new set of environment files, which can be wasteful, especially if there are limits on the number of files and folders that can be created on a HPC cluster. Likewise, users may want to specify the installation locations, especially if databases or environments will take up a considerable amount of disk space. Having centralised locations for your environments and databases and allowing these locations to be customised by the user can alleviate this issue. Caution is advised, however, as specifying a location outside of the working or installation directories may have unforeseen consequences, such as files being moved or deleted. As such, many users will prefer to keep environments and databases in the working directory. In our templates, we use the installation directory of the command line tool as the default for both conda environments and databases as this represents the safest centralised location for these files, but users can specify the working directory if they prefer."

Right now, the workflow use system-wide/user-wide conda environments, that are installed before ever running the workflow. It allows saving disk space, but is subject to external modifications.

We could do a one-time installation of the conda environments in the workflow source directory, and they would then be copied in each run directory. A good compromise on disk space would be to have the runs look for environments in the source directory, so the environment state would be shared with other runs BUT NOT by other applications/users on the server.