Illumina / canvas

Canvas - Copy number variant (CNV) calling from DNA sequencing data
Other
121 stars 20 forks source link

Canvas Copy Number Variant Caller

Canvas is a tool for calling copy number variants (CNVs) from human DNA sequencing data. It can work either with germline data, or paired tumor/normal samples. Its primary input is aligned reads (in .bam format), and its primary output is a report (in a .vcf file) giving the copy number status of the genome.

Canvas is used as the copy number caller in the Isaac Whole Genome Sequencing workflow in BaseSpace (https://basespace.illumina.com), and in HiSeq Analysis Software (HAS) (http://support.illumina.com/sequencing/sequencing_software/hiseq-analysis-software.html).

Canvas is written in C# and runs either under a recent version of Mono (e.g. 3.10.0), .NET 4.5.1 and .NET Core 1.1.

For more information about Canvas and the algorithms it uses see the software design description.

Note: Germline-WGS mode has been deprecated. Use SmallPedigree-WGS even for a single sample analysis.

Publications

Canvas was first described in the publication Canvas: versatile and scalable detection of copy number variants in the journal OUP Bioinformatics:

Publication: http://dx.doi.org/10.1093/bioinformatics/btw163 Preprint: https://doi.org/10.1101/036194

The Canvas Small Pedigree Workflow was published in the journal OUP Bioinformatics: Canvas SPW: calling de novo copy number variants in pedigrees

Publication: https://doi.org/10.1093/bioinformatics/btx618 Preprint: https://doi.org/10.1101/121939

License

Copyright (c) 2013-2017 Illumina, Inc. All rights reserved.

This software is provided under the terms and conditions of the GNU GENERAL PUBLIC LICENSE Version 3

You should have received a copy of the GNU GENERAL PUBLIC LICENSE Version 3 along with this program. If not, see https://github.com/illumina/licenses/.

Canvas includes several third party packages provided under other open source licenses, please see COPYRIGHT.txt for additional details.

Build instructions

Binaries:

It is recommended to start from one of the binary distributions on the Canvas releases page if a suitable version is available. Executables can be run either under .NET Core or mono. .NET Core environment is recommended as it provides higher speed and lower RAM usage.

Source code organization:

Canvas consists of several projects all built from one solution file (Src/Canvas/Canvas/Canvas.sln). The main Canvas project is a command line tool for launching the various workflows. Additionally, there are projects for each Canvas module - e.g. CanvasBin counts coverage for each bin, CanvasSomaticCaller makes CNV calls for tumor/normal data - as well as some shared libraries with utility functions (math functions, file I/O for various formats, etc.)

Linux

Canvas was tested under Linux using .Net core 2.1

See https://github.com/Illumina/canvas/issues/99

Windows

Canvas is known to run on Windows 7 or Windows 8 systems using .NET core 2.0

Run instructions

Canvas can be run on a variety of sequencing inputs. See the help information from the Canvas.exe command line executable for the supported workflows and required input files:

$Canvas.exe --help (or Canvas.dll --help under .NET Core) Canvas Canvas 1.25.0.49+master Copyright © Illumina 2017-03-23 Usage: Canvas.exe [MODE] [OPTIONS]+
Available modes:

Options:
-h, --help show this message and exit
-v, --version print version and exit

Reference genome

The required input files for Human reference genome builds GRCh37, hg19, and GRCh38 can be downloaded from S3 http://canvas-cnv-public.s3.amazonaws.com/. You can use wget to download any of the files listed there. For example, to download the hg19 GenomeSize.xml file run:

wget http://canvas-cnv-public.s3.amazonaws.com/hg19/WholeGenomeFasta/GenomeSize.xml

When using a custom reference genome the equivalent files need to be created. Use the FlagUniqueKmers project to generate the annotated fasta file (kmer.fa) for a custom reference genome.

Installation

The easiest way to install Canvas is to use the latest pre-copiled binaries from releases:https://github.com/Illumina/canvas/releases (just download and uncopress).

.Net Core

See https://github.com/Illumina/canvas/issues/99

DEMO (SmallPedigree-WGS workflow)

Here we provide an example on how to run Canvas SPW (Small Pedigree Workflow) on a simulated trio (bam files of 60x coverage) and then using EvaluateCNV (under Tools) to estimate performance metrics. This demo will work with the Canvas release v1.25 and above. Amazon AWS m4.4xlarge instance was used to create this demo. It is recommended that the amount of RAM per core is 4G. More information on input options and output formats can be found on the canvas wiki and software design document.

Data and binaries

  1. Install .Net Core and download Canvas binary (CanvasDIR)
  2. Add BaseSpace project https://basespace.illumina.com/s/f1ganFhSPsBo with simulation bams to your account (you might need to register first).
  3. Install BaseMount and load the canvas-spw project
    sudo bash -c "$(curl -L https://basemount.basespace.illumina.com/install/)"
    mkdir /tmp/BaseSpace
    basemount --scopes="Create Global, Browse Global, Create Projects, Read Global" /tmp/BaseSpace
    cd /tmp/BaseSpace
  4. This should show the following folders under canvas-spw/AppResults
    - bams = simulated trio bams of 60x coverage aligned with Isaac
    - canvasdata = hg19 genome reference files for running Canvas (can also be downloaded from S3 http://canvas-cnv-public.s3.amazonaws.com/) 
    - snvvcf = SNV vcf files to accompany bams (joint germline CNV calls using Strelka2 https://github.com/Illumina/strelka)
    - simdata = bed files with simulated inherited and de novo variants

Running Canvas

  1. In this example we are accessing files through basemount (Canvas should be run as user rather than sudo root). Files could also be copied to a local drive and run from there.
  2. Issue the following command (output directory - /tmp/gHapMixDemo)
    dotnet /CanvasDIR/Canvas.dll SmallPedigree-WGS --bam=/basespace/Projects/canvas/AppResults/bams/Files/father.bam --bam=/basespace/Projects/canvas/AppResults/bams/Files/mother.bam --bam=/basespace/Projects/canvas/AppResults/bams/Files/child1.bam --mother=mother --father=father --proband=child1 -r /basespace/Projects/canvas/AppResults/canvasdata/Files/kmer.fa -g /basespace/Projects/canvas/AppResults/canvasdata/Files/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta --sample-b-allele-vcf /basespace/Projects/canvas/AppResults/snvvcf/Files/Pedigree.vcf.gz -f /basespace/Projects/canvas/AppResults/canvasdata/Files/filter13.bed -o /tmp/gHapMixDemo --ploidy-vcf="/basespace/Projects/canvas/AppResults/snvvcf/Files/MultiSamplePloidy.vcf"
  3. The runtime will depend on the number of available CPUs and whereas bam files were copied to a local drive. The run on a bare Amazon m4.4xlarge instance (16 CPUs and 64G RAM) with network I/O took 03h34m. Results are available as VCF files: either a multi-sample VCF under gHapMixDemo or single-sample equivalents under gHapMixDemo/TempCNV folders. Here we will use EvaluateCNV tool supplied with Canvas distribution to calculate various performance metrics for inherited and de novo CNVs.

Running EvaluateCNV

  1. First, we can run EvaluateCNV to produce recall and precision metrics for inherited Canvas CNV calls using truth variant files.
    
    zcat /tmp/gHapMixDemo/TempCNV_child1/CNV.vcf.gz | grep -v ":REF:" > /tmp/gHapMixDemo/TempCNV_child1/CNV.vcf (remove REF calls)
    /CanvasDIR/Tools/EvaluateCNV/EvaluateCNV.dll /ihart/BaseSpace/Projects/CanvasSPW/AppResults/simdata/Files/child1_truth.bed /tmp/gHapMixDemo/TempCNV_child1/CNV.vcf /CanvasDIR/Tools/EvaluateCNV/generic.cnaqc.excluded_regions.bed inheritedCNVs.txt 

This gives us for PASS variants: Recall 97.46 Precision 93.85


2. Next, we run a similar command but using the de novo variant truth file and a -q 20 argument to extract variants with DQ20.

/CanvasDIR/Tools/EvaluateCNV/EvaluateCNV.dll /ihart/BaseSpace/Projects/CanvasSPW/AppResults/simdata/Files/child1_truth.bed /tmp/gHapMixDemo/TempCNV_child1/CNV.vcf.gz /CanvasDIR/Tools/EvaluateCNV/generic.cnaqc.excluded_regions.bed -q 20 denovoCNVs.txt

This gives us for PASS variants: Recall 97.98 Precision 96.51


## DEMO (Tumor-normal-enrichment workflow)
This demo will run Canvas on exome data for HCC2218 breast carcinoma cell lines and compare results with previously curated ground truth set. The demo presumes mono runtime and that binary files were installed to WORKDIR/canvas/canvas-1.3.4_x64/. 

#### Data 
To download demo data, add BaseSpace project https://basespace.illumina.com/s/DcPnOqHmtPNB to your account (you might need to register first). The actual files can then be downloaded from the following subdirectories:
https://basespace.illumina.com/analyses/30697313/files/28317292?projectId=26760736
https://basespace.illumina.com/analyses/30697313/files/28296383?projectId=26760736
In addition to manual download, a command line basemount (https://basemount.basespace.illumina.com ) can be used for file transfer. To install basemount run

sudo bash -c "$(curl -L https://basemount.basespace.illumina.com/install/)" mkdir /tmp/BaseSpace basemount /tmp/BaseSpace cd /tmp/BaseSpace

BaseSpace files are now available under your current directory. To run demo, transfer the following files into WORKDIR/testing/files/

“Projects/HiSeq 2500 RR: NRC Exome (HCC1187 & HCC2218)/AppResults/HCC1187BL/Files/HCC1187BL_S1.vcf" (germline vcf) "Projects/HiSeq 2500 RR: NRC Exome (HCC1187 & HCC2218)/AppResults/HCC2218C/Files/HCC2218C_S1.bam" (somatic bam) "Projects/HiSeq 2500 RR: NRC Exome (HCC1187 & HCC2218)/AppResults/HCC2218C/Files/HCC2218C_S1.bam.bai" "Projects/HiSeq 2500 RR: NRC Exome (HCC1187 & HCC2218)/AppResults/HCC2218BL/Files/HCC2218BL_S1.bam" (normal bam) "Projects/HiSeq 2500 RR: NRC Exome (HCC1187 & HCC2218)/AppResults/HCC2218BL/Files/HCC2218BL_S1.bam.bai" “Projects/HiSeq 2500 RR: NRC\ Exome\ (HCC1187 & HCC2218)/AppSessions/Isaac Enrichment 11|24|2015 9:23:23/AppResults.28295376.HCC1187BL/Files/Additional Files/NexteraRapidCapture_Exome_TargetedRegions_v1.2Used.txt” (targeted regions)

#### Genome reference files  
Download hg19 genome reference files from S3 (http://canvas-cnv-public.s3.amazonaws.com/) into WORKDIR/testing/hg19/.

#### Running demo
With all files copied and installed, we are now ready to run Canvas. This demo will use Tumor-normal-enrichment workflow that runs on Nextera exome data.  Execute the command below. 

dotnet Canvas.exe Tumor-normal-enrichment -b $WORKDIR/testing/files/HCC2218C_S1.bam --normal-bam=$WORKDIR/testing/files/HCC2218BL_S1.bam --reference=$WORKDIR/testing/hg19/kmer.fa --manifest=$WORKDIR/testing/files/NexteraRapidCapture_Exome_TargetedRegions_v1.2Used.txt -g $WORKDIR/testing/hg19/ -n HCC2218C -f $WORKDIR/testing/hg19/filter13.bed -o $WORKDIR/testing/HCC2218_v2 --b-allele-vcf=$WORKDIR/testing/files/HCC2218BL_S1.vcf --custom-parameters=CanvasBin,-m=TruncatedDynamicRange

CNV.vcf.gz files will be saved to HCC2218_v2 output directory. Depending on the number of available CPUs, the demo will take from few minutes to under an hour to complete.

#### Inspecting results 
Now we can test Canvas performance by using a set of previously curated HCC2218 copy number calls from whole-genome data (HCC2218Truth.vcf) and a set of repetitive or ambiguous regions (HCC2218.cnaqc.excluded_regions.bed), which are available in the TruthSets directory in S3 http://canvas-cnv-public.s3.amazonaws.com/.  The evaluation is accomplished by using EvaluateCNV; the latest binary distribution for the tool can be found in [releases]:https://github.com/Illumina/canvas/releases.  

EvaluateCNV usage info:

EvaluateCNV $TruthSetPath $CNV.vcf $ExcludedRegionsBed $OutputPath [$RegionOfInterestBed]

In our case, given that truth files location in WORKDIR/tools/EvaluateCNV, the command is:

mono $WORKDIR/tools/EvaluateCNV/EvaluateCNV.exe WORKDIR/TruthSets/HCC2218Truth.vcf $WORKDIR/testing/HCC2218/CNV.vcf.gz $WORKDIR/TruthSets/HCC2218.cnaqc.excluded_regions.bed $WORKDIR/testing/HCC2218/EvaluateCNV.txt

This will save evaluation data into $WORKDIR/testing/HCC2218/EvaluateCNV.txt.
Inspecting it suggests that Canvas performed quite well in calling somatic CNV variants in HCC2218, below is an extract from the file (results obtained using Canvas 1.3.4 with the command line shown above, other versions and main/custom parameters might alter performance metrics)

Accuracy 92.0255 DirectionAccuracy 93.1368 Recall 88.0894 DirectionRecall 92.0237 Precision 81.3032 DirectionPrecision 84.9345