General deployment strategy for reference data files

ptrebert commented 1 year ago

We need to develop a general strategy to deploy reference data files - or potentially also database dumps - on any compute infrastructure.

Requirements:

reference data are strictly "read-only" and any deployment strategy must honor that, and should ideally prevent accidental changes
the strategy must not assume that the target infrastructure has general internet access
integration into standard workflows should be possible but must not be hard-wired. Third-party users must be able to make use of the workflows in their own environment (coming with its own set of restrictions)
it must be trivial to move all reference datasets between different infrastructure
the whole setup must provide a minimal amount of self-documentation: specifying exact data sources, version of the deployed reference data package
the deployment strategy should enable bundling reference data files that are commonly used in conjunction
the deployment strategy must be workflow language-agnostic

Although the current Snakemake pipeline implementing the creation of reference containers needs refactoring to simplify usage, the reference containers themselves fulfill all of the above requirements.

The question: The major point to discuss is how to find a solution that enables transparent and standardized integration into various existing Nextflow workflows. This might entail finding an overall different approach.

sci-kai commented 1 year ago

The nf-core workflows have a common, mostly standardized way of implementing reference genome data: https://nf-co.re/docs/usage/reference_genomes This includes the pre-built reference assemblies from Illumina "iGenomes" that are available through an S3 container by nextflow. However, next to these pre-built reference assemblies the pipelines also use workflow-specific resources. For example, Sarek needs a downloaded VEP cache file for VCF annotation, the rnafusion workflow some other specialized references for programs.

The igenomes reference sets and these special resources can be implemented into their own reference container. A strategy could be to copy the reference files from the containers to the local analysis directories prior to running the nextflow workflows, e.g. with the ./CONTAINER.sif get REF_FILE_NAME_OR_ALIAS [DESTINATION] command.
Those local reference files can be deleted after analysis to save disk space.

Another option would be to code these copy and delete steps as separate module that has to be implemented into the nextflow workflows. This may require more effort for implementation, but could automatize the installation process. For example, the rnafusion workflow supports a separate step via the "--build_reference" flag to automatically download reference files prior to running the workflow (https://nf-co.re/rnafusion/2.1.0/usage#1--download-and-build-references). Similar to this, unpacking and configuration of the reference container might work.

svenbioinf commented 1 year ago

So like Kai, I also differentiate between 1) common references (that are used by multiple pipelines like genome files, transcriptome files) and 2) specialized references (specific for a single pipeline like VEP cache, VCF)

What I dont like is the part with "copying the required files into local directories and delete them afterwards". Cant we let the workflows access those common references at a central location without copying and deleting to a local location?

ptrebert commented 1 year ago

So like Kai, I also differentiate between

1. common references (that are used by multiple pipelines like genome files, transcriptome files)
   and

2. specialized references (specific for a single pipeline like VEP cache, VCF)

There is no such thing as a specialized reference when it's downloaded from an external resource and used as-is. For example, using VEP came up several times in the past couple of months in the lab. It's just quite cumbersome to deploy offline, but commonly requested as part of the analysis. I don't see a compelling reason why VEP would be used only in exactly one workflow? That level of abstraction seems desirable, but potentially quite hard to achieve in a practical setting.

A genuinely special (as in: not expected to be useful in other contexts) reference file should be derived as part of the workflow itself. For example, if my workflow requires that the gene model (say, GENCODE) only consists of pseudo-genes and nothing else, filtering the downloaded GENCODE files to just contain pseudo-genes should happen as part of the workflow. (Note that GENCODE provides dedicated files, e.g., for "basic" gene annotation, or only protein-coding transcript sequences; quite likely that many people just need that and nothing else.)

What I dont like is the part with "copying the required files into local directories and delete them afterwards". Cant we let the workflows access those common references at a central location without copying and deleting to a local location?

What would be your proposed solution to achieve that?

svenbioinf commented 1 year ago

So like Kai, I also differentiate between
1. common references (that are used by multiple pipelines like genome files, transcriptome files)
   and

2. specialized references (specific for a single pipeline like VEP cache, VCF)
There is no such thing as a specialized reference when it's downloaded from an external resource and used as-is. For example, using VEP came up several times in the past couple of months in the lab. It's just quite cumbersome to deploy offline, but commonly requested as part of the analysis. I don't see a compelling reason why VEP would be used only in exactly one workflow? That level of abstraction seems desirable, but potentially quite hard to achieve in a practical setting.

Alright, so no more workflow specialized references in local folders, cause this is afaik how we use nextflow currently.

And concerning my solution: Kai and me talked about that this week shortly and the idea came up to work in a cubi singularity container that has all references and then start nextflow which starts its own singularity containers. However, I am not sure if that works- a container within a container. So far I have only been using singularity containers, not writing my own ones.

sci-kai commented 1 year ago

So like Kai, I also differentiate between
1. common references (that are used by multiple pipelines like genome files, transcriptome files)
   and

2. specialized references (specific for a single pipeline like VEP cache, VCF)
There is no such thing as a specialized reference when it's downloaded from an external resource and used as-is. For example, using VEP came up several times in the past couple of months in the lab. It's just quite cumbersome to deploy offline, but commonly requested as part of the analysis. I don't see a compelling reason why VEP would be used only in exactly one workflow? That level of abstraction seems desirable, but potentially quite hard to achieve in a practical setting.

A genuinely special (as in: not expected to be useful in other contexts) reference file should be derived as part of the workflow itself. For example, if my workflow requires that the gene model (say, GENCODE) only consists of pseudo-genes and nothing else, filtering the downloaded GENCODE files to just contain pseudo-genes should happen as part of the workflow. (Note that GENCODE provides dedicated files, e.g., for "basic" gene annotation, or only protein-coding transcript sequences; quite likely that many people just need that and nothing else.)

That is right, these resources are not workflow-specific. I'd rather wanted to state that specific nf-core workflows require additional resources, while all of them accept the pre-built iGenome sets. The proper differentiation is between pre-built reference sets (e.g. the igenomes database or VEP caches) and single resources and database (like COSMIC, gnomAD, etc.) When developing a new workflow from scratch, its easier to control and update database versions by integrating single resources through such containers. However, it is easier to deploy the pre-built sets and these are integrated into the nf-core workflows. Hence, I would suggest to also maintain singularity containers with mirrors of these sets, as these are not easily replaceable without changing the base code of these workflows.

What I dont like is the part with "copying the required files into local directories and delete them afterwards". Cant we let the workflows access those common references at a central location without copying and deleting to a local location?

Nextflow can work with singularity containers and actually needs to download a bunch of containers with the used software tools to run properly (also another topic, whether we should maintain a copy of these software containers to deploy this workflow offline). So it should be able to implement a direct access to the container, but I don't know yet how to code that and it probably involves changing the base code of the workflows (which may hamper integration of future updates from nf-core).

ptrebert commented 1 year ago

It seems I need to see a hands-on example on using these iGenome sets and also some info about what these files/sets are, how they are identified and deployed etc.

If these are already properly packaged in some way and we could maintain a local (offline mirror) resource with those datasets as basic starting point for the nf-core workflows, then that seems to check enough of the above points to be marked as done.

For the other reference files that are not derived (i.e., they are downloaded from an online resource and placed into the working directory of thte workflow), a simple solution would be to define some metadata information per workflow (in the CUBI fork) and add a mini setup workflow* that takes care of copying the files into place from the local (offline) resource. This would keep upstream compatibility, and should fall into the domain of "automate the boring stuff". Kai, didn't you say you do that, just in form of a bash script at the moment?

*realized as nextflow workflow, this would be a very simple thing to gain some experience with writing Nextflow worflows

edit: following the link that Kai posted, I am greeted with a "warning" that the annotations in many of the iGenome sets are totally outdated. Who is maintaining these iGenome sets? Reads like no one is in charge?

svenbioinf commented 1 year ago

So Phil Ewels took a copy of the iGenomes and uploaded them to AWS s3 bucket and added additional files like indexes for star, all done with a time limited fund. Its 5 TB big. As a suggestion until we got a better solution: We dont we just make a copy of the relevant species from the AWS s3 bucket and store them on a shared folder on the HPC? nf-core pipelines can then be manually pointed to that folder for reference files.

ptrebert commented 1 year ago

all done with a time limited fund.

That means the S3 resource is vanishing at some point?

The data are made public (by Illumina or in the nf-core context) under which license?
Are there "metadata" files listing the original sources for the various genomes, annotations etc.?

So, in its entirety, it seems like the iGenomes resource is a good candidate for local mirroring with an enforced read-only state. That way, all groups could access the reference data on the local infrastructure and a central authority (= Research IT) would manage/regulate that. I'll talk to them about this use case.

svenbioinf commented 1 year ago

I dont see an immediate licence file. iGenomes is maintained by Illumina, so I guess their usual regularities apply there as well, all me guessing however.

On Phil"s github he says: "AWS has agreed to host up to 8TB data for AWS-iGenomes dataset until at least 28th October 2022. The resource has been renewed once so far and I hope that it will continue to be renewed for the forseeable future."

Yes, I think it might be a good idea to approach the research IT for a local read only mirror.

core-unit-bioinformatics / reference-container

General deployment strategy for reference data files #11