A structural variation discovery pipeline for Illumina short-read whole-genome sequencing (WGS) data.
Because GATK-SV has been tested only on the Google Cloud Platform (GCP), we are unable to provide specific guidance or support for other execution platforms including HPC clusters and AWS. Contributions from the community to improve portability between backends will be considered on a case-by-case-basis. We ask contributors to please adhere to the following guidelines when submitting issues and pull requests:
We still encourage members of the community to adapt GATK-SV for non-GCP backends and share code on forked repositories. Here are a some considerations:
glob
commands may differ between platforms.rm
, mv
, tabix
) can cause unexpected behavior on shared filesystems. Enabling copy localization may help to more closely replicate the behavior on GCP.The PED file format is described here. Note that GATK-SV imposes additional requirements:
src/sv-pipeline/scripts/validate_ped.py -p pedigree.ped -s samples.list
.We recommend filtering out samples with a high percentage of improperly paired reads (>10% or an outlier for your data) as technical outliers prior to running GatherSampleEvidence. A high percentage of improperly paired reads may indicate issues with library prep, degradation, or contamination. Artifactual improperly paired reads could cause incorrect SV calls, and these samples have been observed to have longer runtimes and higher compute costs for GatherSampleEvidence.
Sample IDs must:
Sample IDs should not:
chr
, name
, DEL
, DUP
, CPX
, CHROM
The same requirements apply to family IDs in the PED file, as well as batch IDs and the cohort ID provided as workflow inputs.
Sample IDs are provided to GatherSampleEvidence directly and need not match sample names from the BAM/CRAM headers. GetSampleID.wdl
can be used to fetch BAM sample IDs and also generates a set of alternate IDs that are considered safe for this pipeline; alternatively, this script transforms a list of sample IDs to fit these requirements. Currently, sample IDs can be replaced again in GatherBatchEvidence - to do so, set the parameter rename_samples = True
and provide updated sample IDs via the samples
parameter.
The following inputs will need to be updated with the transformed sample IDs:
Please cite the following publication: Collins, Brand, et al. 2020. "A structural variation reference for medical and population genetics." Nature 581, 444-451.
Additional references: Werling et al. 2018. "An analytical framework for whole-genome sequence association studies and its implications for autism spectrum disorder." Nature genetics 50.5, 727-736.
The following resources were produced using data from the All of Us Research Program and have been approved by the Program for public dissemination:
The All of Us Research Program is supported by the National Institutes of Health, Office of the Director: Regional Medical Centers: 1 OT2 OD026549; 1 OT2 OD026554; 1 OT2 OD026557; 1 OT2 OD026556; 1 OT2 OD026550; 1 OT2 OD 026552; 1 OT2 OD026553; 1 OT2 OD026548; 1 OT2 OD026551; 1 OT2 OD026555; IAA #: AOD 16037; Federally Qualified Health Centers: HHSN 263201600085U; Data and Research Center: 5 U2C OD023196; Biobank: 1 U24 OD023121; The Participant Center: U24 OD023176; Participant Technology Systems Center: 1 U24 OD023163; Communications and Engagement: 3 OT2 OD023205; 3 OT2 OD023206; and Community Partners: 1 OT2 OD025277; 3 OT2 OD025315; 1 OT2 OD025337; 1 OT2 OD025276. In addition, the All of Us Research Program would not be possible without the partnership of its participants.
There are two scripts for running the full pipeline:
wdl/GATKSVPipelineBatch.wdl
: Runs GATK-SV on a batch of samples.wdl/GATKSVPipelineSingleSample.wdl
: Runs GATK-SV on a single sample, given a reference panelExample workflow inputs can be found in /inputs
. Build using scripts/inputs/build_default_inputs.sh
, which
generates input jsons in /inputs/build
. Except the MELT docker image, all required resources are available in public
Google buckets.
Important: The example input files contain MELT inputs that are NOT public (see Requirements). These include:
GATKSVPipelineSingleSample.melt_docker
and GATKSVPipelineBatch.melt_docker
- MELT docker URI (see Docker readme)GATKSVPipelineSingleSample.ref_std_melt_vcfs
- Standardized MELT VCFs (GatherBatchEvidence)The input values are provided only as an example and are not publicly accessible. In order to include MELT, these values must be provided by the user. MELT can be disabled by deleting these inputs and setting GATKSVPipelineBatch.use_melt
to false
.
We recommend running the pipeline on a dedicated Cromwell server with a cromshell client. A batch run can be started with the following commands:
> mkdir gatksv_run && cd gatksv_run
> mkdir wdl && cd wdl
> cp $GATK_SV_ROOT/wdl/*.wdl .
> zip dep.zip *.wdl
> cd ..
> bash scripts/inputs/build_default_inputs.sh -d $GATK_SV_ROOT
> cp $GATK_SV_ROOT/inputs/build/ref_panel_1kg/test/GATKSVPipelineBatch/GATKSVPipelineBatch.json GATKSVPipelineBatch.my_run.json
> cromshell submit wdl/GATKSVPipelineBatch.wdl GATKSVPipelineBatch.my_run.json cromwell_config.json wdl/dep.zip
where cromwell_config.json
is a Cromwell workflow options file. Note users will need to re-populate batch/sample-specific parameters (e.g. BAMs and sample IDs).
The pipeline consists of a series of modules that perform the following:
Repository structure:
/dockerfiles
: Resources for building pipeline docker images/inputs
: files for generating workflow inputs
/templates
: Input json file templates/values
: Input values used to populate templates/wdl
: WDLs running the pipeline. There is a master WDL for running each module, e.g. ClusterBatch.wdl
./scripts
: scripts for running tests, building dockers, and analyzing cromwell metadata files/src
: main pipeline scripts
/RdTest
: scripts for depth testing/sv-pipeline
: various scripts and packages used throughout the pipeline/svqc
: Python module for checking that pipeline metrics fall within acceptable limits/svtest
: Python module for generating various summary metrics from module outputs/svtk
: Python module of tools for SV-related datafile parsing and analysis/WGD
: whole-genome dosage scoring scriptsA minimum cohort size of 100 is required, and a roughly equal number of males and females is recommended. For modest cohorts (~100-500 samples), the pipeline can be run as a single batch using GATKSVPipelineBatch.wdl
.
For larger cohorts, samples should be split up into batches of about 100-500 samples. Refer to the Batching section for further guidance on creating batches.
The pipeline should be executed as follows:
Note: GatherBatchEvidence requires a trained gCNV model.
For larger cohorts, samples should be split up into batches of about 100-500 samples with similar characteristics. We recommend batching based on overall coverage and dosage score (WGD), which can be generated in EvidenceQC. An example batching process is outlined below:
GATKSVPipelineSingleSample.wdl
runs the pipeline on a single sample using a fixed reference panel. An example run with reference panel containing 156 samples from the NYGC 1000G Terra workspace can be found in inputs/build/NA12878/test
after building inputs).
Both the cohort and single-sample modes use the GATK-gCNV depth calling pipeline, which requires a trained model as input. The samples used for training should be technically homogeneous and similar to the samples to be processed (i.e. same sample type, library prep protocol, sequencer, sequencing center, etc.). The samples to be processed may comprise all or a subset of the training set. For small, relatively homogenous cohorts, a single gCNV model is usually sufficient. If a cohort contains multiple data sources, we recommend training a separate model for each batch or group of batches with similar dosage score (WGD). The model may be trained on all or a subset of the samples to which it will be applied; a reasonable default is 100 randomly-selected samples from the batch (the random selection can be done as part of the workflow by specifying a number of samples to the n_samples_subsample
input parameter in /wdl/TrainGCNV.wdl
).
New reference panels can be generated easily from a single run of the GATKSVPipelineBatch
workflow. If using a Cromwell server, we recommend copying the outputs to a permanent location by adding the following option to the workflow configuration file:
"final_workflow_outputs_dir" : "gs://my-outputs-bucket",
"use_relative_output_paths": false,
Here is an example of how to generate workflow input jsons from GATKSVPipelineBatch
workflow metadata:
> cromshell -t60 metadata 38c65ca4-2a07-4805-86b6-214696075fef > metadata.json
> python scripts/inputs/create_test_batch.py \
--execution-bucket gs://my-exec-bucket \
--final-workflow-outputs-dir gs://my-outputs-bucket \
metadata.json \
> inputs/values/my_ref_panel.json
> # Build test files for batched workflows
> python scripts/inputs/build_inputs.py \
inputs/values \
inputs/templates/test \
inputs/build/my_ref_panel/test \
-a '{ "test_batch" : "ref_panel_1kg" }'
> # Build test files for the single-sample workflow
> python scripts/inputs/build_inputs.py \
inputs/values \
inputs/templates/test/GATKSVPipelineSingleSample \
inputs/build/NA19240/test_my_ref_panel \
-a '{ "single_sample" : "test_single_sample_NA19240", "ref_panel" : "my_ref_panel" }'
> # Build files for a Terra workspace
> python scripts/inputs/build_inputs.py \
inputs/values \
inputs/templates/terra_workspaces/single_sample \
inputs/build/NA12878/terra_my_ref_panel \
-a '{ "single_sample" : "test_single_sample_NA12878", "ref_panel" : "my_ref_panel" }'
Note that the inputs to GATKSVPipelineBatch
may be used as resources for the reference panel and therefore should also be in a permanent location.
The following sections briefly describe each module and highlights inter-dependent input/output files. Note that input/output mappings can also be gleaned from GATKSVPipelineBatch.wdl
, and example input templates for each module can be found in /inputs/templates/test
.
Formerly Module00a
Runs raw evidence collection on each sample with the following SV callers: Manta, Wham, and/or MELT. For guidance on pre-filtering prior to GatherSampleEvidence
, refer to the Sample Exclusion section.
Note: a list of sample IDs must be provided. Refer to the sample ID requirements for specifications of allowable sample IDs. IDs that do not meet these requirements may cause errors.
.bai
) must be provided if using BAMs.Formerly Module00b
Runs ploidy estimation, dosage scoring, and optionally VCF QC. The results from this module can be used for QC and batching.
For large cohorts, this workflow can be run on arbitrary cohort partitions of up to about 500 samples. Afterwards, we recommend using the results to divide samples into smaller batches (~100-500 samples) with ~1:1 male:female ratio. Refer to the Batching section for further guidance on creating batches.
We also recommend using sex assignments generated from the ploidy estimates and incorporating them into the PED file, with sex = 0 for sex aneuploidies.
The purpose of sample filtering at this stage after EvidenceQC is to prevent very poor quality samples from interfering with the results for the rest of the callset. In general, samples that are borderline are okay to leave in, but you should choose filtering thresholds to suit the needs of your cohort and study. There will be future opportunities (as part of FilterBatch) for filtering before the joint genotyping stage if necessary. Here are a few of the basic QC checks that we recommend:
Trains a gCNV model for use in GatherBatchEvidence. The WDL can be found at /wdl/TrainGCNV.wdl
. See the gCNV training overview for more information.
Formerly Module00c
Runs CNV callers (cn.MOPS, GATK-gCNV) and combines single-sample raw evidence into a batch. See above for more information on batching.
Formerly Module01
Clusters SV calls across a batch.
Formerly Module02
Generates variant metrics for filtering.
Formerly Module03
Filters poor quality variants and filters outlier samples. This workflow can be run all at once with the WDL at wdl/FilterBatch.wdl
, or it can be run in two steps to enable tuning of outlier filtration cutoffs. The two subworkflows are:
outlier_cutoff_nIQR
based on the SV count plots and outlier previews from step 1. Note that not removing high outliers can result in increased compute cost and a higher false positive rate in later steps.Formerly MergeCohortVcfs
Combines filtered variants across batches. The WDL can be found at: /wdl/MergeBatchSites.wdl
.
Formerly Module04
Genotypes a batch of samples across unfiltered variants combined across all batches.
Formerly Module04b
Re-genotypes probable mosaic variants across multiple batches.
Formerly Module0506
Combines variants across multiple batches, resolves complex variants, re-genotypes, and performs final VCF clean-up.
Apply downstream filtering steps to the cleaned VCF to further control the false discovery rate; all steps are optional and users should decide based on the specific purpose of their projects.
Filtering methods include:
minGQ - remove variants based on the genotype quality across populations. Note: Trio families are required to build the minGQ filtering model in this step. We provide tables pre-trained with the 1000 genomes samples at different FDR thresholds for projects that lack family structures, and they can be found at the paths below. These tables assume that GQ has a scale of [0,999], so they will not work with newer VCFs where GQ has a scale of [0,99].
gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.10perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt
gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.1perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt
gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.5perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt
BatchEffect - remove variants that show significant discrepancies in allele frequencies across batches
FilterOutlierSamplesPostMinGQ - remove outlier samples with unusually high or low number of SVs
FilterCleanupQualRecalibration - sanitize filter columns and recalibrate variant QUAL scores for easier interpretation
Formerly Module08Annotation
Add annotations, such as the inferred function and allele frequencies of variants, to final VCF.
Annotations methods include:
Visualize SVs with IGV screenshots and read depth plots.
Visualization methods include:
This repository is maintained following the norms of
continuous integration (CI) and continuous delivery (CD).
GATK-SV CI/CD is developed as a set of Github Actions
workflows that are available under the .github/workflows
directory. Please refer to the workflow's README
for their current coverage and setup.
RuntimeAttr
inputs. These are formatted like this in the json:
"MyWorkflow.runtime_attr_override": {
"disk_gb": 100,
"mem_gb": 16
},
Note that a subset of the struct attributes can be specified. See wdl/Structs.wdl
for available attributes.
Example error message from GatherSampleEvidence.MELT.GetWgsMetrics
:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: The requested index 701766 is out of counter bounds. Possible cause of exception can be wrong READ_LENGTH parameter (much smaller than actual read length)
This error message was observed for a sample with an average read length of 117, but for which half the reads were of length 90 and half were of length 151. As a workaround, override the calculated read length by providing a read_length
input of 151 (or the expected read length for the sample in question) to GatherSampleEvidence
.