chapmanb / bcbio.variation

Toolkit to analyze genomic variation data, built on the GATK with Clojure
66 stars 15 forks source link

bcbio.variation

A toolkit to analyze genome variation data, built on top of the Genome Analysis Toolkit (GATK) with Clojure. It supports scoring for the Archon Genomics X PRIZE competition and is also a general framework for variant file comparison. It enables validation of variants and exploration of algorithm differences between calling methods by automating the process involved with comparing two sets of variants. For users, this integrates with the bcbio-nextgen framework to automate variant calling and validation. For developers, bcbio.variation provides command line tools and an API to clean and normalize variant data in VCF format avoiding comparison artifacts associated with different variant representations.

Build Status

Obtaining

Download

The latest release is 0.2.6 (17 July 2015): bcbio.variation-0.2.6-standalone.jar. Run from the command line:

$ java -jar bcbio.variation-VERSION-standalone.jar [arguments]

The jar contains a full GATK commandline with additional walkers, as well as custom command line programs. The usage section below contains examples of using the library for variant comparison, normalization and ensemble calling. Note that bcbio.variation requires Java 1.7 since the underlying GATK libraries are not compatible with earlier versions.

As a library

To use as a library from Leiningen or Maven, follow the instructions on the clojars page.

Latest source

The latest version is available directly from GitHub and requires Java 1.6 or better and Leiningen (version 2). Lein will automatically pull in all required dependencies.

Usage

Generate summary of concordance between variant calls

To compare two GATK compatible VCF files in a specific region:

$ java bcbio.variation.jar variant-utils comparetwo eval.vcf refcall.vcf ref.fa regions.bed

You can also tune many parameters with a YAML configuration file specifies the variant files for comparison. The project contains example configuration and associated variant files that demonstrate the features of the library and the configuration below has a description of available options.

An example of scoring a phased diploid genome against a haploid reference genome:

$ java bcbio.variation.jar variant-compare config/reference-grading.yaml

An example of assessing variant calls produced by different calling algorithms:

$ java bcbio.variation.jar variant-compare config/method-comparison.yaml

Normalize a variant file

A tricky part of variant comparisons is that VCF format is flexible enough to allow multiple representations. As a result two files may contain the same variants, but one might have it present in a multi-nucleotide polymorphism while another represents it as an individual variant.

To produce a stable, decomposed variant file for comparison run:

$ java bcbio.variation.jar variant-prep your_variants.vcf your_reference.fasta

This will also handle re-ordering variants to match the reference file ordering, essential for feeding into tools like GATK, and remapping hg19 to GRCh37 chromosome names. To additionally filter outputs by indel size, pass an argument specifying the maximum indel size to include: --max-indel 30. To retain reference (0/0) and no calls in the prepped file, use --keep-ref.

Ensemble variant calls

bcbio.variation contains approaches to merge variant calls from multiple approaches. It normalizes all input VCFs, prepares a combined variant file with variants from all approaches, and then filter the combined file to produce a final set of calls.

To combine multiple variant calls into a single combined ensemble callset:

$ java bcbio.variation.jar variant-ensemble params.yaml reference.fa
                           out.vcf in1.vcf in2.vcf in3.vcf

Web interface

The o8 visualization framework provides a web interface to this toolkit, providing interaction with Galaxy and GenomeSpace, visualization of biological metrics associated with variants, reactive filtering and automated scoring.

Run GATK walker for variant statistics

$ lein uberjar
$ java -jar target/bcbio.variation-RELEASE-standalone.jar -T VcfSimpleStatsWalker
  -R test/data/GRCh37.fa --variant test/data/gatk-calls.vcf --out test.png

Run custom GATK annotator

$ lein uberjar
$ java -jar target/bcbio.variation-RELEASE-standalone.jar -T VariantAnnotator
   -A MeanNeighboringBaseQuality -R test/data/GRCh37.fa -I test/data/aligned-reads.bam
   --variant test/data/gatk-calls.vcf -o annotated-file.vcf

Configuration file

A YAML configuration file defines targets for comparison processing. Two example files for reference grading and comparison of calling methods provide example starting points and details on available options are below:

dir:
  base: Base directory to allow use of relative paths (optional).
  out: Working directory to write output.
  prep: Prep directory where files will be pre-processed.
experiments: # one or more experiments
 - sample: Name of current sample.
   ref: Reference genome in FASTA format.
   intervals: Intervals to process in BED format (optional).
   align: Alignments for all calls in BAM format (optional).
   summary-level: Amount of summary information to provide,
                  [full,quick] (default:full)
   params: # Processing parameters associated with this experiment
     max-indel: Maximum indel size to include in non-structural variant
                comparisons (default: 5000)
     multiple-thresh: Threshold for percentage of comparisons to match
                      to consider two multiple sample variants the same.
                      (default: 1.0)
     compare-approach: Provide alternative approaches to compare variants:
                       approximate -- allow flexible matching of het/hom variants
                                      and indels
     recall-approach: Method to use for recalling variants:
                      consensus -- Most common variant from multiple inputs
   approach: Type of comparison to do [compare,grade]. Default compare.
   calls: # two or more calls to compare
     - name: Name of call type
       file: One or more input files in VCF format
       align: Alignment for specific call in BAM format (optional).
       ref: Reference genome if different than experiment ref (optional)
       intervals: Genome intervals to process in BED format (optional).
       metadata: Dictionary of annotations associated with the call set.
                 Finalizers use these to provide annotation specific
                 filtering of calls.
       filters: Provide hard filtering of variants prior to comparison with 
                specified JEXL GATK expressions.
       format-filters: Provide hard filtering of variants based on
                       attributes in the genotype FORMAT field.
       recall: Recall variant positions after merging multiple input calls. 
               Can tune with `recall-approach` in `params` (boolean; default false)
       annotate: Annotate calls with GATK annotations (boolean; default false).
       normalize: Normalize MNPs and indels (boolean: default true).
       prep: Prep with in-order chromosomes and sample names (boolean; default false).
       prep-sort-pos: Sort by position during prep. Required if variants are
                      not coordinate sorted within chromosomes. (boolean; default false).
       fix-sample-header: Adjust VCF sample header names to match sample
                          specified in `sample` (boolean; default false)
       prep-sv-genotype: Normalize structural variant genotypes to a single
                         ref call (boolean; default false).
       prep-allele-count: Number of alleles to convert calls to during
                          prep work (default 2)
       preclean: Remove problematic characters from input VCFs
                 (boolean; default false). 
       remove-refcalls: Remove reference, non-variant calls.
                        (boolean; default false). 
       make-haploid: Convert a set of diploid calls to haploid variants
                    (boolean; default false)

Finalizer configuration

In addition to the pairwise comparisons, the configuration allows specification of additional filtration and all-by-all comparisons based on the pairwise results. Like calls, specify these under an experiment with the finalize tag. Available methods are:

The options for filtering are:

Utilities

This library also contains useful command line utilities to help with variant preparation and analysis:

Contributors

License

The code is freely available under the MIT license.