NYU-Molecular-Pathology / NGS580-nf

Target exome sequencing analysis for NYU NGS580 gene panel
GNU General Public License v3.0
10 stars 6 forks source link
docker exome exome-sequencing-analysis nextflow pipeline singularity

NGS580-nf

Target exome analysis for 580 gene panel (NGS580)

NOTE: Details listed here may change during development

Overview

This pipeline is designed to run targeted exome analysis on Illumina Next-Gen sequencing genomic data, in support of the NGS580 cancer diagnostic panel for NYU's Molecular Pathology Department.

This pipeline starts from paired-end fastq data (.fastq.gz), and is meant to accompany the output from the Illumina demultiplexing pipeline listed here: https://github.com/NYU-Molecular-Pathology/demux-nf.

The NGS580-nf analysis workflow includes read trimming, QC, alignment, variant calling, annotation, and reporting, along with many other steps.

Contents

Some key pipeline components included in this repository:

Pipeline Items

Some key components that are created during setup, configuration, and execution of the pipeline:

Setup

This repository should first be cloned from GitHub:

git clone --recursive https://github.com/NYU-Molecular-Pathology/NGS580-nf.git
cd NGS580-nf

Reference Data

Nextflow pipelines have been included for downloading required reference data, including ANNOVAR reference databases. You can run them with the following command:

make setup

HapMap Pool .bam

A negative control HapMap pool .bam file can be prepared using the following command:

make hapmap-pool

This file is typically built from multiple HapMap samples previously aligned by this pipeline. For demonstration purposes, you can provide any .bam and .bai files.

The HapMap Pool files to be used in the pipeline should be set under the HapMapBam and HapMapBai keys of config.json.

CNV Pool

A control normal sample .cnn file for CNV calling can be prepared using the following command:

make cnv-pool

This file is typically built from .bam files of specially chosen normal tissue sequencing samples previously aligned by this pipeline. For demonstration purposes, you can create the .cnn file from any desired .bam file. Note that the targets .bed file used to create the .cnn file must match the targets used in the rest of the pipeline.

The .cnn file to be used in the pipeline should be set under the CNVPool key in config.json.

Containers

The containers directory contains instructions and recipes for building the Docker and Singularity containers used in the pipeline.

Docker is typically used for local container development, while Singularity containers are used on the NYU Big Purple HPC cluster. The current pipeline configuration for Big Purple uses .simg files stored in a common location on the file system.

Usage

The pipeline is designed to start from demultiplexed paired end .fastq.gz files, with sample ID, tumor ID, and matched normal ID associations defined for each set of R1 and R2 .fastq file using a file samples.analysis.tsv (example included at example/samples.analysis.tsv).

Deployment

The easiset way to use the pipeline is to "deploy" a new instance of it based on output from the demultiplexing pipeline demux-nf. This will automatically propagate configurations and information from the demultiplexing output.

The pipeline can also deploy a new, pre-configured copy of itself using the included deploy recipe:

make deploy PRODDIR=/path/to/NGS580_analyses RUNID=Name_for_analysis FASTQDIR=/path/to/fastq_files

Create Config

A file config.json is required to hold settings for the pipeline. It should be created using the built-in methods:

make config RUNID=my_run_ID FASTQDIR=/path/to/fastqs

or

make config RUNID=my_run_ID FASTQDIRS='/path/to/fastqs1 /path/to/fastqs2'

or

cp .config.json config.json

and then simply edit the new config.json and update the items to match your pipeline settings.

Once created, the config.json file can be updated manually as needed. The template and default values can be viewed in the included .config.json file.

Create Samplesheet

A samplesheet file samples.analysis.tsv is required in order to define the input samples and their associated .fastq files (example included at example/samples.analysis.tsv). Create a samplesheet, based on the config file, using the built-in methods:

make samplesheet

Sample Pairs (Optional)

The NGS580-nf pipeline has special processing for tumor-normal pairs. These pairs should be defined in the samples.analysis.tsv file, by listing the matched Normal sample for each applicable sample.

In order to update samples.analysis.tsv automatically with these sample pairs, an extra samplesheet can be provided with the tumor-normal pairs.

Create a samples.tumor.normal.csv samplesheet (example included at example/samples.tumor.normal.csv) with the tumor-normal groupings for your samples, and update the original samplesheet with it by running the following script:

python update-samplesheets.py --tumor-normal-sheet samples.tumor.normal.csv

If a demultiplexing samplesheet with extra tumor-normal pairs information was supplied (see example: example/demux-SampleSheet.csv), then it can be used to update the samplesheet with pairs information with the following recipe:

make pairs PAIRS_SHEET=demux-SampleSheet.csv PAIRS_MODE=demux

Run

The pipeline includes an auto-run functionality that attempts to determine the best configuration to use for NYU phoenix and Big Purple HPC clusters:

make run

This will run the pipeline in the current session.

In order to run the pipeline in the background as a job on NYU's Big Purple HPC, you should instead use the submit recipe:

make submit SUBQ=fn_medium

Where SUBQ is the name of the SLURM queue you wish to use.

Refer to the Makefile for more run options.

Due to the scale of the pipeline, a "local" run option is not currently configured, but can be set up easily based on the details shown in the Makefile and nextflow.config.

Extra Parameters

You can supply extra parameters for Nextflow by using the EP variable included in the Makefile, like this:

make run EP='--runID 180320_NB501073_0037_AH55F3BGX5

Demo

A demo dataset can be loaded using the following command:

make demo

This will:

You can then proceed to run the analysis with the commands described above.

More Functionality

Extra functions included in the Makefile for pipeline management include:

make clean

Removes all Nextflow output except for the most recent run. Use make clean-all to remove all pipeline outputs.

make record PRE=some_prefix_

"Records" copies of the most recent pipeline run's output logs, configuration, Nextflow reports, etc.. Useful for recording analyses that failed or had errors in order to debug. Include the optional argument TASK to specify a Nextflow work directory to include in the records (example: make record PRE=error_something_broke_ TASK=e9/d9ff34).

make kill

Attempts to cleanly shut down a pipeline running on a remote host e.g. inside a SLURM HPC compute job. Note that you can also use scancel to halt the parent Nextflow pipeline job as well.

make fix-permissions, make fix-group

Attempts to fix usergroup and permissions issues that may arise on shared systems with multiple users. Be sure to use the extra argument USERGROUP=somegroup to specify the usergroup to update to.

make finalize-work-rm

Examines the trace.txt output from the most recent completed pipeline run in order to determine while subdirectories in the Nextflow work dir are no longer needed, and then deletes them. Can delete multiple subdirs in parallel when run with make finalize-work-rm -j 20 e.g. specifying to delete 20 at a time, etc.

Software

Developed under Centos 6, RHEL 7, macOS 10.12