MiDataInt / mdi-pipelines-framework

Stage 1 pipelines framework for the Michigan Data Interface
MIT License
0 stars 1 forks source link

Singularity containers to maximize portability and launch ease #17

Closed wilsonte-umich closed 2 years ago

wilsonte-umich commented 2 years ago

The challenges

Portability

The MDI uses conda environments to encourage rigorously defined and version-controlled software program support for Stage 1 pipelines. However, that sometimes is not enough to guarantee portability from system to system.

In particular, the Linux version and installed system libraries – which live at a lower level than installed programs (but higher than the Linux kernel) – can create inconsistencies in execution. Also, developers sometimes forget to enumerate – or simply aren't aware of – important software on their system required to make their pipeline work.

Launch ease

A lesser but sometimes irritating limitation is that by insisting on conda for environment consistency, end users must run a somewhat slow and sometimes confusing process to create the conda environments required to run a pipeline. This might limit adoption by some users.

The solution = containers

To quote Docker, a "container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another. " Importantly, that packaged software includes the Linux version and system libraries mentioned above.

Thus, pipelines with portability challenges can usually be made to run consistently within the even more controlled environment of a properly configured container. Furthermore, such a container can easily have already run the 'conda --create' command, making pipelines immediately usable.

Singularity vs. Docker

The best known container platform is Docker, but it requires elevated system privileges to run containers and so is uniformly disallowed on the HPC servers generally required by Stage 1 pipelines. In contrast, Singularity, is a container platform that addresses this issue and so is specifically intended for HPC use. Accordingly, the MDI will use Singularity.

Task 1 - add the option to run pipelines in a container

MDI pipeline definitions should be enhanced to include:

1) a pipeline.yml section called 'singularity' for developers to declare whether container encapsulation is:

Thus, pipeline.yml might look like:

singularity:
  type: available
  url: oras://ghcr.io/user/container

Then, mostly 'execute.pl' in the pipelines framework should be modified to run the pipeline using the indicated container, built from 'singularity.def', when required or requested. These few changes would allow outstanding communication between developers and users. A template of singularity.def with helpful comments will guide developers less familiar with container definitions.

Task 2 - create a repo/AWS helper image for creating pipeline container images

Singularity does offer elevated privileges to build containers images, so on many HPC servers only administrators can do such building. Fortunately, containers can be built on another computer and then downloaded to a user's HPC solution.

Thus, the MDI should create a repository called 'mdi-container-builder' to help developers build containers on their own Linux computer, or, even better, to quickly launch an AWS instance from a public AMI with mdi-container-builder and Singularity already installed. Developers could use a single, simple command to git clone and then singularity build a complete container for a specific MDI tools suite.

Additional benefits of containers

The same issues enumerated above can arise not only for pipelines originally written in the MDI, but also for pipelines written elsewhere that a developer would like to offer through the MDI. By enhancing container support, it should be straightforward for MDI developers to offer tools suites optimized for running 3rd party pipelines (as opposed to constructing a new pipeline that calls programs directly).

wilsonte-umich commented 2 years ago

This issue required careful integration with version control (critical, because containers are created per pipeline and tagged with pipeline versions) - both Singularity container support and pipeline version control are now available, with minor logistical differences from the initial vision in the issue above.

See the associated release tag for a summary:

Here is the link to the container-builder repo to quickly set up AWS instances for building MDI containers: