bioinformatics-centre / BayesTyper

A method for variant graph genotyping based on exact alignment of k-mers
87 stars 7 forks source link
bayesian-inference dna-sequencing genotyping gibbs-sampling indels k-mers ngs snvs structural-variation variant-calling variant-graphs

BayesTyper

BayesTyper performs genotyping of all types of variation (including SNPs, indels and complex structural variants) based on an input set of variants and read k-mer counts. Internally, BayesTyper uses exact alignment of k-mers to a graph representation of the input variants and reference sequence in combination with a probabilistic model of k-mer counts to do genotyping.

Latest News

Why should I use BayesTyper?

Short explanation

Because it allows you to obtain accurate genotypes spanning from SNVs/short indels to complex structural variants and hence provides a more complete picture of the genome as compared with standard methods - without sacrificing accuracy.

Detailed explanation

Standard methods for genotyping (e.g. GATK-HaplotypeCaller, Platypus and Freebayes) start from an alignment of reads (e.g. by BWA-MEM) and then

  1. Call candidate variants.
  2. Perform local realignment of reads anchored to a particular variable region to candidate variant haplotypes.
  3. Estimate genotypes (and thus final variant calls).

This approach can result in a bias towards the reference sequence since reads informative for a particular variant may have been left either unaligned (because of too large an edit distance to the reference) or have aligned better elsewhere in the reference.

The variant graph approach used by BayesTyper ensures that the resulting calls are not biased towards the reference sequence by effectively realigning all reads (or more specifically their k-mers) when genotyping candidate variants. In our recent paper, we show how this approach significantly improves both sensitivity and genotyping accuracy for most variant types - especially non-SNVs (please see citation below).

Installation

  1. Download the latest static Linux x86_64 build (k=55) found under releases.

    • To build from source, please refer to the build wiki for detailed build instructions. Note that the k-mer size is determined compile time. Hence, if you want k ≠ 55 you need to compile it yourself (or post a feature request and we will see what we can do).
  2. Download the BayesTyper human data bundle (GRCh37 and GRCh38) containing reference sequences preprocessed for BayesTyper (i.e. canonical and decoy chromosomes) together with a reference matched variation prior database.

Usage

The BayesTyper genotyping process occurs in two stages:

  1. Generation of variant candidates (using other tools)
  2. Genotyping based on variant candidates (using BayesTyper)

As indicated, BayesTyper does not find candidate variants on its own. Instead, users can combine the variant discovery strategies suitable for their study as it will depend on the study design (e.g. coverage, number of samples etc.) as well as the available resources.

Below we outline an example strategy, where candiates are obtained using

The complete workflow (i.e. BAM(s) to genotypes) outlined below is further provided as a snakemake workflow for easy deployment of BayesTyper. Please refer to the snakemake wiki for detailed instructions on how to set up and execute the workflow on your data.

Important: Please note that it is currently only possible to genotype 30 samples at the time using BayesTyper. To run more samples, please execute BayesTyper in batches as described in the batching wiki. Batching is currently not supported by the snakemake workflow - please let us know if you require this feature by filing a feature request.

Important: Bayestyper supports uncompressed and gzip compressed vcf files. Please note that bgzip compression is currently not supported.

Important: If you intend to genotype other organisms than human, please refer to the other organism wiki for more information.

1. Generation of variant candidates

Starting from a set of indexed, aligned reads (obtained e.g. using BWA-MEM):

  1. For each sample, run HaplotypeCaller to get standard mapping-based candidates

    • Running Base Quality Recalibration or doing joint genotyping is not required
    • Faster alternative: Freebayes
  2. For each sample, run Platypus to identify small and medium sized variants

  3. For each sample, run Manta to identify candidates by de novo local assembly (important for detecting larger deletions and insertions). Convert allele IDs (e.g. \<DEL>) in the Manta output to sequences using bayesTyperTools convertAllele

  4. For each caller, left-align and normalize variants using bcftools norm (bcftools)

  5. Combine variants across all samples, callers and the variation prior using bayesTyperTools combine -v GATK:<gatk_sample1>.vcf,GATK:<gatk_sample2>.vcf,PLATYPUS:<platypus_sample1>.vcf,PLATYPUS:<platypus_sample2>.vcf,MANTA:<manta_sample1>.vcf,...,prior:<prior>.vcf -o <candiate_variants_prefix> -z

    • Note: The source tag before the colon (e.g. GATK) only serves to identify the origin of the calls in the final callset - it can take any value.
    • Important: bayesTyperTools combine requires the vcf header to contain contig entries (e.g.##contig=<ID=8,length=146364022>) for all reference sequences containing variants in the vcf; the contigs further need to appear in the same order in the header and for the variant entries.

2. Genotyping based on variant candidates

  1. Count k-mers

    1. Run KMC3 on each sample (set k=55 using -k55 and include singleton k-mers using -ci1)
      • Note: Default is fq(.gz) input - bam input is enabled using -fbam.
      • Important: If the reads are in bam format, make sure the file also contains the unmapped reads.
    2. Create a read k-mer bloom filter for each sample from the KMC3 output using bayesTyperTools makeBloom -k <kmc_output_prefix> -p <num_threads>
      • Important: The resulting bloom filter (.bloomMeta and .bloomData) and the KMC3 output (.kmc_pre and .kmc_suf) must reside in the same directory and have the same prefix
  2. Identify variant clusters: bayesTyper cluster -v <candiate_variants_prefix>.vcf.gz -s <samples>.tsv -g <ref_build>_canon.fa -d <ref_build>_decoy.fa -p <num_threads>

    • Note: This partitions the candidate variants into units written to separate directories (bayestyper_unit_1, bayestyper_unit_2, ...) - each containing between 5M and 10M variants. These can be genotyped independently e.g. on a cluster (supported by the snakemake workflow) followed by a simple concatenation operation using bcftools concat (see below).
    • Important: The <samples>.tsv file should contain one sample per row with columns , \ and and no header (example).
    • Important: Data that is common to all units and necessary for genotyping is written to the bayestyper_cluster_data directory.
    • Note: Human reference files (canon/decoy) are provided in the BayesTyper data bundle (see Installation).
  3. Genotype variant clusters: bayesTyper genotype -v bayestyper_unit_<unit_id>/variant_clusters.bin -c bayestyper_cluster_data -s <samples>.tsv -g <ref_build>_canon.fa -d <ref_build>_decoy.fa -o bayestyper_unit_<unit_id>/bayestyper -z -p <num_threads>

    • Note: The genotype command also applies the default BayesTyper hard-filters by setting the variant FILTER status and the sample specific allele filter (SAF) format attribute. Please refer to the filter wiki for information about the filters used, how to changes the defaults up front and how to refilter the data after running the genotyping step.
    • Important: The filtering procedure only filters the genotypes ("./.") and hence, downstream tools should prefilter on the variant quality and FILTER==\"PASS\" (e.g. using bcftools) to obtain the filtered calls.
  4. Concatenate units using bcftools: bcftools concat -O z -o <output_prefix>.vcf.gz bayestyper_unit_1/bayestyper.vcf.gz bayestyper_unit_2/bayestyper.vcf.gz ...

    • Important: Unit arguments to bcftools should be in ascending order (unit_1 unit_2 ...) for the output to be properly sorted

Computational requirements

Number of samples Coverage Number of variant alleles Max allele length (nts) Number of threads Wall time (h, single node) Wall time (h, multiple nodes)* Max memory (GB)
3 13x 21.4M 500,000 28 5-6 2-3 41
3 13x 64.4M 500,000 28 17-18 4-5 42
13 50x 11.7M 10,000 28 31-32 16-17 66
13 50x 61.1M 10,000 28 92-93 15-16 62

* bayesTyper genotype can be distributed across nodes on a cluster - between 2 and 11 nodes were used in this benchmark.

The time estimates are for running bayesTyper cluster and bayesTyper genotype only. Expect <1h combined run-time per sample for counting k-mers by KMC3 and bloom filter creation by bayesTyperTools. All runs were done on a 64-bit Intel Xeon 2.00 GHz machine with 128 GB of memory using v1.3.

Citing BayesTyper

Sibbesen JA, Maretty L, The Danish Pan-Genome Consortium & Krogh A: Accurate genotyping accross variant classes and lengths using variant graphs. Nature Genetics, 2018. link. *Equal contributors.

Studies that have used BayesTyper

Please let us know if you use BayesTyper in your publication - then we will put it on the list.

Contact

Please post an issue if you have questions regarding how to run BayesTyper, if you want to report bugs or request new features. You can also reach us at jasi at binf dot ku dot dk or lasse dot maretty at clin dot au dot dk.

Third-party software acknowledgements

We thank the developers of the third-party libraries used by BayesTyper: