NAIC Accelerated Genomics is a scalable and reproducible suite of GPU-enabled Next Generation Sequencing (NGS) analysis workflows/pipelines. It is implemented using the Clara Parabricks
software, which is developed by NVIDIA. The main purpose of this suite is to accelerate NGS analysis using GPU platforms. Additionally, pipeline suite provides CPU-based pipelines for benchmarking and quality control (QC) purposes. This documentation provides a user guide for installing and using NAIC Accelerated Genomics suite.
├── LICENSE
├── README.md <- The top-level README for developers using this project
├── data <- Data directory
│ ├── external_testdata <- Test-data from third-party sources
│ ├── internal_testdata <- Intermediate Test-data
│ ├── processed <- The final, canonical data sets for modeling
│
├── tools <- Bioinformatics tools used in NVIDIA Parabricks
│
├── references <- User guide, manuals, and all other explanatory materials
├── workflows <- Collection of scalable and reproducible workflows/pipelines that automate complex NGS raw-data processing tasks
|
├── notebooks <- Jupyter notebooks with detailed analysis
|
| *NOTE: NOT YET IMPLEMENTED*
|
├── reports <- *NOTE: NOT YET IMPLEMENTED (Generated analysis as HTML, PDF, LaTeX, etc.)*
│ └── figures <- *Generated graphics and figures to be used in reporting*
NAIC Accelerated Genomics
is a suite of complex NGS analysis pipelines that can be used to process and analyze NGS data. The suite provides GPU-based NGS pipelines for accelerated analysis, as well as CPU-based pipelines for benchmarking and quality control (QC) purposes. The following pipelines are available in the suite:
NAIC Accelerated Genomics GitHub ripository
git clone git@github.com:NAICNO/accelerated_genomics.git
NGS analysis workflows are available in accelerated_genomics/workflows
For detailed instructions on how to use the different pipelines in the NAIC Accelerated Genomics suite, please refer to the respective pipeline documentation provided in Implementation details
section.
NAIC Accelerated Genomics has the following dependencies that need to be installed before running the pipelines:
wget -qO- https://get.nextflow.io | bash
Java: NextFlow recommends Java version 11 up to 20 for executing and managing tasks. Please ensure that you have one of these Java versions installed. Note that the NGS analysis processes use the Java version available in the corresponding containers.
Docker and/or Singularity: NAIC Accelerated Genomics ensures reproducibility through self-contained process execution and strict software version control. Docker or Singularity can be used for this purpose. The tools page provides information on the Docker and Singularity images used in the pipelines. Please note that if you are running the pipelines via Singularity, you need to update the correct path to the directory containing Singularity images in the conf/singularity.conf and conf/slurm.conf files. Set the singularityDir variable to the path of the directory containing the Singularity images.
Docker or Singularity
NAIC Accelerated Genomics
pipelinesdef singularityDir = '<path to directory containg singularity images>' // conf/singularity.conf & conf/slurm.conf
Software and hardware requirements necessary to deploy Accelerated NGS pipeline, as outlined in the Parabricks.v4.0.0 guide:
Minimum System Specifications per GPU Configuration
Software Requirements
For various analyses, the NGS pipelines utilize specific reference datasets. Detailed information is available in the references readme page of the documentation.
The following pipelines can be executed using a set of NextFlow commands:
These pipelines can be run employing different execution platforms, which are selectable via NextFlow command line parameters:
--profile singularity
: for executing with the Singularity platform--profile docker
: for running on Docker platform.--profile slurm
: for utilization with the SLURM executorTo conduct germline sequence analysis using GPUs, execute the following command:
./nextflow run \
germline_pipeline.nf \
-profile <PROFILE> \
--fastq_folder <"path to the directory with raw sequence data"> \
--genome_folder <"path to the directory with reference data"> \
--genome_json <JSON listing reference files> \
--processor GPU \
--target_regions <"path to the target region file"> \
-with-report \
-with-trace \
-resume
For a comprehensive overview, refer to the Germline pipeline page, which includes a workflow diagram and description of processes. For details on NVIDIA's Clara Parabricks, consult the Parabricks readme page.
To initiate the CPU-based germline sequence analysis, use the command below with the --processor CPU
parameter:
./nextflow run \
germline_pipeline.nf \
-profile <PROFILE> \
--fastq_folder <"path to the directory with raw sequence data"> \
--genome_folder <"path to the directory with reference data"> \
--genome_json <JSON listing reference files> \
--processor CPU \
--target_regions <"path to the target region file"> \
-with-report \
-with-trace \
-resume
Perform quality control assessments on NGS data using the following command:
./nextflow run qc_cpu.nf \
-profile <PROFILE> \
--fastq_folder <"path to the directory with raw sequence data"> \
--sample_name <name of the sample> \
--target_regions <"path to the target region file"> \
--bam_path <"path to the alignment file - BAM file"> \
--bai_path <"path to the alignment-index file - BAI file"> \
--genome_folder <"path to the directory with reference data"> \
--genome_json <JSON listing reference files> \
-with-report \
-with-trace \
-resume
QC pipeline
with workflow chart and included processes.To evaluate outcomes from both GPU and CPU germline pipelines, execute the following:
./nextflow run vc_eval.nf \
-profile <PROFILE> \
--sample_name NA12878_35x \
--VC_GPU <VCF from GPU pipeline> \
--VC_CPU <VCF from CPU pipeline> \
--genome_folder <REFERENCE> \
--genome_json reference_data.json \
--target_regions <TARGET REGIONS> \
--diff_map_regions <DIFFICULT TO MAP REGIONS provided by precisionFDA> \
--functional_regions <FUNCTIONAL REGIONS provided by precisionFDA> \
-with-trace