esteinig / cerebro

Metagenomic diagnostics stack for low abundance sample types and clinical reporting
GNU General Public License v3.0
0 stars 0 forks source link

Production execution mode #4

Open esteinig opened 9 months ago

esteinig commented 9 months ago

Summary

Semi-automated production runtime for accreditation; overview for aims and progress to complete the production feature on feat/production.

Aims

Stack deployment

File input and storage

Production pipeline

User improvements

Testing modules

Documentation

Steps

Full-stack setup of the Cerebro production environment for continuous operations. For setting up parallel test or development environments, see details in the documentation.

Requirements

Stack setup and verification

Production setup and verification

Production directory and sub-directories are setup on the system - you can read more about the types of production environments that are currently supported in the documentation.

Here we setup the RUNTIME directory where all workflows are executed, and the INPUT directory where wet-lab staff or laboratory data transfer can deposit the reads and sample sheet to trigger a worfklow execution.

# Local paths for runtime and data input
export CEREBRO_BASE_PROD=/data/cerebro/prod
export CEREBRO_INPUT_PROD=/samba/project/cerebro/prod

# Setup the runtime directory where workflows are executed
cerebro production setup-base --directory $CEREBRO_RUN_PROD

# Setup the input directory with a specific team and database upload configuration
cerebro production setup-input --directory $CEREBRO_INPUT_PROD --configuration production --team-name VIDRL --database-name "META-GP Production"

Multiple runtime and input folders can be setup for testing, development or validation configurations. Workflow execution and outputs are configured with specific production variables that ensure

Workflow setup and testing

Workflow is setup for production and integration tests are run for production.

# Check workflow help menu as sanity check
nextflow run esteinig/cerebro -r 1.0.0-nata.1 --help

# Provision the accreditation database with Cipher
nextflow run esteinig/cerebro -r 1.0.0-nata.1 -profile mamba -entry cipher --revision 1.0.0-nata.1 --outdir cipher/

# Obtain the access token for the API
export CEREBRO_API_URL="http://api.cerebro.localhost"
export CEREBRO_API_TOKEN=$(cerebro api login -u $CEREBRO_USERNAME -p $CEREBRO_PASSWORD)

# Run workflow integration tests for setup and central nervous system infections
nextflow run esteinig/cerebro -r 1.0.0-nata.1 -profile mamba,ciqa-setup@v1,ciqa-cns@v1

Sample sheet for wet-lab

Current sample sheet is focused on dry-lab operation. We need a user-safe sample sheet template that registers the library identifiers, minimal sample meta-data, wet-lab comments, aneuploidy consent and links to the files in the same input directory

Initial template: https://github.com/esteinig/cerebro/blob/feat/production/templates/production/SampleSheet.xlsx

Automated watcher and input checks

Sample sheet and fastq files (demultiplexed, de-umified) are watched and validated in the input folder. Depending on the input configuration file the watcher will run production stream and upload to the specified team-database-collection at conclusion of run - different input configuration files (folders) can be watched by different production, test, validation... watchers and outputs deposited into the appropriate database section. Triggers run of the Nextflow pipeline and notifications to Slack.

When the pipeline starts, sample identifiers are checked against the team-database-collection to ensure they are unique - the run is registered with the database and samples await confirmation of completion. If sample identifier exists in database the run fails.

Post-workflow sample checks

When the pipeline completes, sample identifiers are collected and validated against registered sample identifiers for this run. Each module (quality control, classification) is checked for completion in each sample. If a sample for some reason did not complete the module, it is marked in the database.

Post-workflow data compilation and upload

After completion, outputs are aggregated into the database models and uploaded into the specified collection via the API

Progress

esteinig commented 9 months ago

Some tricky development notes for the multi-stack subdomain deployment: