SBIMB / StellarPGx

Calling star alleles in highly polymorphic pharmacogenes (e.g. CYP450 genes) by leveraging genome graph-based variant detection.
MIT License
30 stars 7 forks source link

StellarPGx

StellarPGx: Calling star alleles in highly polymorphic pharmacogenes by leveraging genome graph-based variant detection.

CYP450 genes supported: CYP2D6, CYP2A6, CYP2B6, CYP2C19, CYP2C9, CYP2C8, CYP3A4, CYP3A5, CYP1A1, CYP1A2, CYP2E1, and CYP4F2

Other pharmacogenes supported: CYPOR (POR), NAT1, NAT2, GSTM1, GSTT1, SLCO1B1, NUDT15, TPMT, and UGT1A1

StellarPGx is built using Nextflow, a workflow management system that facilitates parallelisation, scalability, reproducibility and portability of pipelines via Docker and Singularity technologies.

Please endeavour to always use the latest version of StellarPGx

At present, StellarPGx only supports short-read high-coverage whole genome sequence data as input. Enhancements to include support for exome, ADME gene panel and/or long-read WGS are ongoing.

Maintainer: David Twesigomwe (twesigomwedavid@gmail.com)

Getting started

The following are required to run the StellarPGx pipeline;

  1. Prerequisite software

Singularity (v3.1.x or higher) is highly recommended especially for running the pipeline in an HPC environment running Linux OS. Docker desktop is recommended for MacOS users intending to run/test the pipeline on a local machine. If you're just using your Mac to connect to a Linux cluster environment, then you can just proceed with Singularity on the cluster as the default.

  1. Whole genome sequence (WGS) data

    • Indexed BAM/CRAM files
  2. Reference genome

    • hg19, b37, or hg38

Note: For a full description of the differences among reference genomes, please check out this Documentation by the GATK team. For the purpose of using this pipeline, if the GRCh37 reference genome you are using has contigs that start with 'chr' (i.e. chr1, chr2, ..., chrX, chrM, ...), use the hg19 option. You should use the b37 option if the contigs in the GRCh37 reference genome do not have 'chr' (i.e. 1, 2, ..., X, MT). For GRCh38, the hg38 option is sufficient.

Installation

Nextflow:

Install Nextflow by running the following command (Skip if you have Nextflow installed already):

curl -fsSL get.nextflow.io | bash

Move the nextflow launcher (installed in your current directory) to a directory in your $PATH e.g. $HOME/bin

mv nextflow $HOME/bin

(The full Nextflow documentation can be found here)

Singularity or Docker:

For Singularity installation, please refer to the excellent documentation here). Ensure that your Singularity installation allows user-defined binds - set by your system administrator (See Singularity config file documentation)

For Docker installation, please refer to the excellent documentation here)

StellarPGx repository:

Clone the StellarPGx repository by running the following command:

git clone https://github.com/SBIMB/StellarPGx.git && cd StellarPGx

Running StellarPGx on the provided test dataset(s) - using Singularity (default)

The following steps assume that; i. StellarPGx is your current working directory ii. Nextflow and Singularity are already installed

Step 1 - Parameters

The parameters for Singularity are set as default so no need to change anything.

Step 2 - Execution of the pipeline

For execution on a local machine or single cluster node
nextflow run main.nf -profile standard,test
For execution on SLURM scheduler
nextflow run main.nf -profile slurm,test
Note:

If you get the error Failed to submit process to grid scheduler for execution, then you need to ask your system administrator for the appropriate process.queue value in the slurm stanza (see nextflow.config file) to your cluster's queue. The default is bash – other examples include defq, Main, etc.

Step 3 - Expected output

The expected output file (SIM001_2d6.alleles) for test dataset SIM001.bam will be found in the ./results/cyp2d6/alleles directory. It should contain the following;

--------------------------------------------

CYP2D6 Star Allele Calling with StellarPGx

--------------------------------------------

Initially computed CN = 2

Core variants:
42126611~C>G~1/1;42127608~C>T~0/1;42127941~G>A~1/1;42129132~C>T~0/1;42129770~G>A~0/1

Candidate alleles:
['17.v1_29.v1']

Result:
*17/*29

Activity score:
1.0

Metaboliser status:
Intermediate metaboliser (IM)

Running StellarPGx on the provided test dataset(s) - using Docker

At the moment, only Docker Desktop on MacOS has been tested. The following steps assume that you have already installed Docker Desktop on your Mac as indicated above.

Step 1 - Pull the Docker container

Pull the stellarpgx-dev container from Docker Hub by running the command below:

docker pull twesigomwedavid/stellarpgx-dev:latest

Step 2 - Disable Singlularity(default) and enable Docker instead in the nextflow.config file

Enabling Docker in the nextflow.config file:

docker {
    enabled = true
    runOptions = '-u \$(id -u):\$(id -g)'
  }

Disabling Singularity in the nextflow.config file:

singularity {
    enabled = false
    autoMounts = true
    cacheDir = "$PWD/containers"
  }

Additionally, comment out the Singularity container variable (default) and set the variable container to point to the docker image instead i.e.


// container = "$PWD/containers/stellarpgx-dev.sif"  // this is to take the Singularity container out of the equation

container = "twesigomwedavid/stellarpgx-dev:latest" // this to set the container path to the Docker image containing all the dependencies that StellarPGx requires

Step 3 - Execution on a local machine

(Assumes that you're running Docker Desktop for MacOS)

nextflow run main.nf -profile standard,test

Step 4 - Expected output

Similar to Singularity run.

Running StellarPGx on your project data

Tip:

Depending on your network connection to a computer cluster, it is highly recommended to run StellarPGx while using screen when analysing multiple samples so that in case your connection breaks, the jobs are not terminated midway through.

Once again, the following steps assume that; i. StellarPGx is your current working directory ii. Nextflow and Singularity or Docker are already installed

Step 1 - Singularity vs Docker

Follow the aforementioned guidelines to decide between either Singularity or Docker. To reiterate, we recommend Docker for MacOS Desktop users. Singularity (default) is ideal for running StellarPGx on HPC cluster/server environments running Linux OS and also for Linux local machines.

Step 2 - Set the input paths in the nextflow.config file

Set the parameters for your input data (in_bam) and the reference genome (ref_file) in the nextflow.config file following the syntax described therein.

For single sample:

in_bam = "/path/to/Sample*{bam,bai}"

For all samples stored in the same directory (Advisable to create symlinks in the data directory if the samples are stored in various directories):

in_bam = "/path/to/*{bam,bai}"

Feel free to also specify samples with particular strings in their names:

in_bam = "/path/to/HG*{bam,bai}"

For CRAM input:

in_bam = "/path/to/samples/*{cram,crai}"

For reference genome:

ref_file = "/path/to/reference/genome.fasta"

Results directory:

Optionally, you may set the out_dir to a path of choice. The default output folder is ./results under the StellarPGx directory.

Step 3 - Run the pipeline (Default is for GRCh38 aligned data)

For execution on a local machine

nextflow run main.nf -profile standard --build [hg38/b37/hg19] --gene [e.g. cyp2d6]

For execution via a scheduler e.g. SLURM

nextflow run main.nf -profile slurm --build [hg38/b37/hg19] --gene [e.g. cyp2d6]
Using CRAM input

If you are using CRAM files as input, then ensure to supply the option --format compressed

nextflow run main.nf -profile [standard/slurm etc] --format compressed --build [hg38/b37/hg19] --gene [e.g. cyp2d6]
GRCh37 aligned data

In case your data is aligned to b37 or humanG1Kv37 (have contigs without 'chr' at the start), run the pipeline using the option --build b37 option:

nextflow run main.nf -profile [standard/slurm etc] --build b37 --gene [e.g. cyp2d6]

If instead your data is aligned to hg19 or GRCh37 (have most/all contigs starting with 'chr') run the pipeline using the option --build hg19 option:

nextflow run main.nf -profile [standard/slurm etc] --build hg19 --gene [e.g. cyp2d6]

Step 4 - Results

See result files matching each sample in the ./results/<(gene)> folder or custom predefined path.

Notice that there is a separate result-file for each sample and a separate results directory for each gene. We have included a handy script called get_results_summary.sh under ./scripts/general in order to facilitate getting the summary of allele calls after running StellarPGx.

For example, if one wishes to get a summary of CYP2D6 allele calls for each sample in the ./results/cyp2d6 folder after analysis, the following easy steps would produce a nice and simple summary table;

  1. Copy the get_results_summary.sh script to the directory with the results files
cp path/to/scripts/general/get_results_summary.sh path/to/results/cyp2d6/
  1. Run the get_results_summary.sh as follows;
bash get_results_summary.sh -s <sample-names-list> -o <output-file-name>

NB:

Since StellarPGx is based on Nextflow, a directory called work is created each time you run the pipeline. The work directory is primarily useful for debugging purposes as it contains the input, output, script details and error report for each process in the pipeline. Remember to delete these work directories after your analysis to save space on your disk.

Citation

If you use StellarPGx in your PGx analysis, please cite our recently accepted article:

David Twesigomwe, Britt I. Drögemöller, Galen E.B. Wright, Azra Siddiqui, Jorge da Rocha, Zané Lombard and Scott Hazelhurst. StellarPGx: A Nextflow pipeline for calling star alleles in cytochrome P450 genes. Clinical Pharmacology and Therapeutics, 110(3), 741–749. doi:10.1002/cpt.2173.

License

MIT License

Thank you for choosing StellarPGx :nerd_face: