NBISweden / wgs-structvar

Whole Genome Sequenceing Structural Variation Pipelines
GNU General Public License v3.0
15 stars 7 forks source link

Whole Genome Sequencing Structural Variation Pipeline

Quick start

# Install nextflow:
curl -fsSL get.nextflow.io | bash
mv ./nextflow ~/bin

# Set work dir to no-backup, put this in your .bashrc
export NXF_WORK=$SNIC_NOBACKUP/work

# Pull worfklow from this repo, run manta, normalize, and variant effect predictor:
nextflow run -profile milou NBISweden/wgs-structvar --project <uppmax_project_id> --bam <bamfile.bam> --steps manta,normalize,vep

# Monitor log file
tail -f .nextflow.log

Your summary files will be in the results subdirectory.

General information

This is a pipeline for running the two structural variation callers fermikit and manta on UPPMAX. You can choose to run either of the two structural variation callers or both (and generate summary files). The main focus on this pipeline is to enable better comparisions with the SweGen dataset, the default parameters for the tools are the same that were used for that dataset. If you have access to the structural variants in the swegen dataset you can add that file to the pipeline and thereby have the ability to filter population specific variants.

Profiles for running on Uppmax HPC clusters

It is possible to run the pipeline in a few different ways. Either as a single-node job or letting nextflow distribute the tasks using the SLURM queing engine. There is also some slight differences in module usage depending on which HPC system is used.

specify the profile to use with the -profile option to NextFlow:

-profile milou
Run on the milou cluster using the queueing system (for example, directly from the login node).
-profile miloulocal
Run on milou but only on the local node. Use this in a batch job on one node, reserve it for 48 hours and everything should be ok.
-profile bianca
The same as `milou` but on the Bianca system
-profile biancalocal
The same as `miloulocal` but on the Bianca system

Masking

Artifact masking

The pipeline will use the following mask files to remove known artifacts:

You can configure the location of the artifact mask files with the --mask_artifact_dir command line option.

Cohort masking

The pipeline can take bed files to filter variants. To run the pipeline with filters put the bed files in the mask_cohort/ subdirectory and add the mask_cohort option to the --steps comma separated command line argument, eg:

cp some_bed_file.bed <path-to-wgs-structvar>/mask_cohort/

nextflow run -profile biancalocal <path-to-wgs-structvar>/main.nf --project <uppmax_project_id> --bam <bamfile.bam> --steps manta,normalize,vep,mask_cohort

You can configure the location of the cohort mask files with the --mask_cohort_dir command line option.

Detailed usage

Command line options

Run a local copy of the wgs-structvar WF:
    nextflow main.nf --bam <bamfile> [more options]
OR run from github:
    nextflow nbisweden/wgs-structvar --bam <bamfile> [more options]

Options:
  Required
    --bam           Input bamfile
       OR
    --runfile       Input runfile for multiple bamfiles in the same run.
                    Whitespace separated, first column is bam file,
                    second column is output directory and an optional third column
                    with a run id to more easily keep track of the run (otherwise
                    it\'s autogenerated).
    --project       Uppmax project to log cluster time to
    -profile <profile>
                    Where profile can be any of milou, localmilou, bianca,
                    localbianca and devel. The local<x> are for running the
                    entire workflow on a single node on the cluster, without
                    the local prefix the slurm queueing system is used.
  Optional
    --help          Show this message and exit
    --fastq         Input fastqfile (default is bam but with fq as fileending)
                    Used by fermikit, will be created from the bam file if
                    missing.
    --steps         Specify what steps to run, comma separated: (default: manta, vep)
                Callers: manta, fermikit
                Annotation: vep, snpeff
                Extra: normalize (with vt),
                       mask_cohort (with bed files in mask_cohort/)
    --sg_mask_ovlp  Fractional overlap for use with the filter option
    --no_sg_reciprocal  Don't use a reciprocal overlap for the filter option
    --outdir        Directory where resultfiles are stored (default: results)
    --prefix        Prefix for result filenames (default: no prefix)
    --mask_artifacts_dir
                    Directory with bed files for artifact filtering (default: mask_artifacts)
    --mask_cohort_dir
                    Directory with bed files for cohort filtering (default: mask_cohort)

The log file .nextflow.log will be produced when running and can be monitored by e.g. tail -f .nextflow.log

Customization

Nextflow can pull from github (master branch) so if you specify this repo it will run what is currently in it. However if you want to customize the parameters more you will want to clone the repo and edit the nextflow.config file in it. It's probably only the params scope of the config file that is of interest to customize.

The first part has the default values for the command line parameters, see the usage message for information on them.

The next section has the reference assembly to use, both as fasta and assembly name.

You may want to use different versions of the modules used by this workflow, currently you will have to edit the profiles to do that. On uppmax we have the milou profile which specifies all the modules and versions, see the config/milou.config.

The runtimes of the different programs is set in the config/standard.config file. That file also specifies how to deal with errors and the interaction with the Slurm scheduler, you probably don't want to change those unless you know what you are doing.

The two folders mask_artifacts and mask_cohort contain bed files to filter the vcf-files from the callers. The artifact directory contains files that should remove problematic regions, it removes everything that has an overlap of at least 25% with a region in the artifact mask. The cohort one is for more stringent filtering of already known variants, and here the default filter threshold is instead a reciprocal overlap of 95%. It can be customized with the two options sg_mask_ovlp (default 0.95) and no_sg_reciprocal.

Support

If you need help with this module, please create a support issue in github.

Other tools for generating structural variants

External links