This repository hosts a small Snakemake pipeline that enables (mostly) automatic builds of reference containers
.
Reference containers package related sets of reference data files that are used in computational analyses
(e.g., a reference genome plus index files) in a minimal yet self-sufficient and self-documenting Singularity container.
When looking for solutions to ship large-ish datasets using Singularity containers, one may stumble upon the work Rioux and colleagues, which is somewhat similar in spirit:
arXiv 2020: Deploying large fixed file datasets with SquashFS and Singularity
Reference implementation on github
There is a Snakemake environment defined in workflow/envs/run_*.yaml
. Since this pipeline is assumed to be
executed on a machine where the user is root (the most straightforward way to build containers),
and retrieving data from cloud hosters usually requires some login or client configuration,
this repo cannot provide an out-of-the-box solution for all possible download sources.
As a rule of thumb, if the download works "live" in the shell, then it should also work as part of this pipeline.
Additionally, the following binaries must be available in your $PATH
besides the download utilities:
git
singularity
sudo apt-get install awscli
Tested on Ubuntu 20.04, installs AWS version:
aws-cli/1.22.34 Python/3.10.4 Linux/5.15.0-47-generic botocore/1.23.34
Use snap for automated updates:
snap install google-cloud-sdk --classic
Source:
cloud.google.com/sdk/docs/downloads-snap
Each reference container has the same internal structure, supports three special commands and of course
can be inspected using singularity inspect
or singularity run-help
.
All data are located under /payload
inside the container. Each data file can have up to two symlinks
created under /payload
to enable aliasing of files. For example, the original reference file
Homo_Sapiens_assembly38_noalt.fasta
may be aliased (symlinked) by just genome.fasta
to make working with the
reference files easier, especially when using the file names in analysis pipelines.
./CONTAINER.sif manifest
prints the MANIFEST to stdout (from its location /payload/MANIFEST.tsv
)
./CONTAINER.sif readme
prints the README to stdout (from its location /payload/README.txt
)
Note that a README in the container is optional.
./CONTAINER.sif get REF_FILE_NAME_OR_ALIAS [DESTINATION]
copies the reference file to the current
working directory if DESTINATION is omitted or to DESTINATION. This command can be used to copy the necessary
references to the current analysis directory. Caveat: the container path /payload
must be ommitted, and the file name
must include the file extension.
Note that all of the above commands are just shorthands for singularity run CONTAINER.sif COMMAND
. Additionally,
since the Singularity container is fully functional, it supports all other common operations (if the required
binary is available in the container). For example, to get the uncompressed version of a reference file, one could run
the command:
singularity exec CONTAINER.sif gzip -d -c /payload/REF_FILE_NAME_OR_ALIAS.gz > REF_FILE_NAME_OR_ALIAS
The manifest is a tab-separated text table with header. The table columns are as follows:
Referring to a specific file by name or by one of the aliases is equivalent.
During the build process of a container, it is checked that no two files specify
an identical alias, but note that file names or aliases can be identical between
containers. Reference files can be downloaded as part of an archive or in
compressed form and be decompressed before copying into the container. Hence,
the file name given as the source path may be slightly different from the file
name in the container (e.g., having the file extension fasta.gz
instead of just fasta
).
For complete information, please refer to the Singularity documentation:
sylabs.io/guides/3.5/user-guide/build_env.html
Since all reference files will be copied to a temporary location during the build process,
the default /tmp/XXX
folder can easily run out of space depending on the user's specific
system configuration. Cache and temp folder can be configured by setting the environment
variables SINGULARITY_CACHEDIR
and SINGULARITY_TMPDIR
. Passing these variables to the root
environment for the building process can be achieved by setting the -E
option for sudo
:
sudo -E singularity build ...
However, if root and user cache and temp locations are set to the same folder, then user-level
operations, e.g. singularity exec
, that attempt to use the cache may run into permission errors.
A simple workaround is to set a shell alias for the Singularity build command that specifies separate
cache and temp folders on a storage location with sufficient space even for large container builts:
alias buildsif='sudo SINGULARITY_CACHEDIR=/local/large_volume/singularity_build SINGULARITY_TMPDIR=/local/large_volume/singularity_build singularity build'
The requirements to use reference containers in Snakemake workflows are as follows:
$PATH
env module
(e.g., on HPCs), the name of the module
can be specified by setting the option singularity_env_module
in the Snakemake confguration
(by default, the name is set to Singularity
)pandas
, pytables
and hdf5
packagesreferences/
in the Snakemake working directory
references_derived/
, to avoid rule ambiguityIf the above requirements are met, add the following code snippet at the top of your main Snakefile
(assuming that you are following standard layout recommendations and your main Snakefile
is located
in the workflow/
subfolder of your repository):
import pathlib
refcon_module = pathlib.Path("ref-container/workflow/rules/commons/005_refcon.smk")
refcon_repo_path = config.get("reference_container_store", None)
if refcon_repo_path is None:
refcon_repo_path = pathlib.Path(workflow.basedir).parent.parent
else:
refcon_repo_path = pathlib.Path(refcon_repo_path)
assert refcon_repo_path.is_dir()
refcon_include_module = refcon_repo_path / refcon_module
include: refcon_include_module
[rest of the Snakefile]
The above enables you to either specify the top-level path where you cloned the ref-container
repository
as part of you Snakemake configuration, or to simply put the ref-container
repository next to your
workflow repository as follows:
$ ls
ref-container/
your-pipeline/
In your Snakemake configuration, you need to set the folder name where the reference containers are stored...
reference_container_store: PATH_TO_THE_CONTAINER_FOLDER
...and list the containers to use:
reference_container_store: PATH_TO_THE_CONTAINER_FOLDER
reference_container_names:
- ref_container1
- ref_container2
- ref_container3
The reference container module included above will automatically retrieve requested reference files from the containers, or raise an error if a file cannot be found or is not unambiguously identifiable.
To document which files have been used in your workflow, you can copy/archive the manifest files of the containers
that are cached in your pipeline working directory under cache/refcon/
at the end of your analysis run.