BayesTyper performs genotyping of all types of variation (including SNPs, indels and complex structural variants) based on an input set of variants and read k-mer counts. Internally, BayesTyper uses exact alignment of k-mers to a graph representation of the input variants and reference sequence in combination with a probabilistic model of k-mer counts to do genotyping.
1 April 2019: New release (v1.5) featuring:
--noise-genotyping
) where noise parameters and genotypes are estimated jointly instead of sequentially. This allows for uncertainty in the noise estimates to be directly propagated into the genotype posteriors. For larger genomes the noise estimates are generally fairly stable, however for smaller genomes with few variants this is often not the case. Also, all variants even nested are used for noise estimation in this mode. Note, that this mode will in most cases be slower and require more memory than the default.bayesTyper genotype
output. The quality is calculated from the maximum genotype posterior probability (GPP) and is Phred-scaled. --min-homozygote-genotypes
filter from bayesTyper genotype
. Due to several improvements to BayesTyper over the last couple of releases this filter is not as important as it used to be. Note, that it is still possible to apply the filter using bayesTyperTools filter
.--max-number-of-sample-haplotypes
and increased its default value to 32. A higher value has been shown to give better results when genotyping a small number of samples. Note, that this increase might result in longer computation time especially for more complex variant clusters. --noise-rate-prior
) to better reflect the expected Illumina error rate. bayesTyperTools convertAllele
. The sequences stored in the variant attributes SEQ or SVINSSEQ are now used as the inserted sequence for \<INS> alleles. In addition, a fasta file containing the inserted sequences can be given with >"name" matching \<"name">. Furthermore, support for partial insertions (Manta output) where the center and length is unknown has been added.addMaxGenotypePosterior
since it is no longer relevant now that genotype qualities are calculated during genotyping. Added filterAlleleCallsetOrigin
script that can filter alleles based on their origin (ACO).28 January 2019: Patch (v1.4.1)
bayesTyper genotype
occasionally crashing on larger datasets (see release notes).18 October 2018: New release (v1.4) featuring:
--chromosome-ploidy-file
in bayesTyper genotype. Ploidy levels 0, 1 (haploid) and 2 (diploid) are supported. Human ploidy levels are assumed if no file is given (see wiki for more details).Because it allows you to obtain accurate genotypes spanning from SNVs/short indels to complex structural variants and hence provides a more complete picture of the genome as compared with standard methods - without sacrificing accuracy.
Standard methods for genotyping (e.g. GATK-HaplotypeCaller, Platypus and Freebayes) start from an alignment of reads (e.g. by BWA-MEM) and then
This approach can result in a bias towards the reference sequence since reads informative for a particular variant may have been left either unaligned (because of too large an edit distance to the reference) or have aligned better elsewhere in the reference.
The variant graph approach used by BayesTyper ensures that the resulting calls are not biased towards the reference sequence by effectively realigning all reads (or more specifically their k-mers) when genotyping candidate variants. In our recent paper, we show how this approach significantly improves both sensitivity and genotyping accuracy for most variant types - especially non-SNVs (please see citation below).
Download the latest static Linux x86_64 build (k=55) found under releases.
Download the BayesTyper human data bundle (GRCh37 and GRCh38) containing reference sequences preprocessed for BayesTyper (i.e. canonical and decoy chromosomes) together with a reference matched variation prior database.
The BayesTyper genotyping process occurs in two stages:
As indicated, BayesTyper does not find candidate variants on its own. Instead, users can combine the variant discovery strategies suitable for their study as it will depend on the study design (e.g. coverage, number of samples etc.) as well as the available resources.
Below we outline an example strategy, where candiates are obtained using
The complete workflow (i.e. BAM(s) to genotypes) outlined below is further provided as a snakemake workflow for easy deployment of BayesTyper. Please refer to the snakemake wiki for detailed instructions on how to set up and execute the workflow on your data.
Important: Please note that it is currently only possible to genotype 30 samples at the time using BayesTyper. To run more samples, please execute BayesTyper in batches as described in the batching wiki. Batching is currently not supported by the snakemake workflow - please let us know if you require this feature by filing a feature request.
Important: Bayestyper supports uncompressed and gzip compressed vcf files. Please note that bgzip compression is currently not supported.
Important: If you intend to genotype other organisms than human, please refer to the other organism wiki for more information.
Starting from a set of indexed, aligned reads (obtained e.g. using BWA-MEM):
For each sample, run HaplotypeCaller to get standard mapping-based candidates
For each sample, run Platypus to identify small and medium sized variants
For each sample, run Manta to identify candidates by de novo local assembly (important for detecting larger deletions and insertions). Convert allele IDs (e.g. \<DEL>) in the Manta output to sequences using bayesTyperTools convertAllele
For each caller, left-align and normalize variants using bcftools norm
(bcftools)
Combine variants across all samples, callers and the variation prior using bayesTyperTools combine -v GATK:<gatk_sample1>.vcf,GATK:<gatk_sample2>.vcf,PLATYPUS:<platypus_sample1>.vcf,PLATYPUS:<platypus_sample2>.vcf,MANTA:<manta_sample1>.vcf,...,prior:<prior>.vcf -o <candiate_variants_prefix> -z
bayesTyperTools combine
requires the vcf header to contain contig entries (e.g.##contig=<ID=8,length=146364022>
) for all reference sequences containing variants in the vcf; the contigs further need to appear in the same order in the header and for the variant entries.Count k-mers
-k55
and include singleton k-mers using -ci1
)
-fbam
.bayesTyperTools makeBloom -k <kmc_output_prefix> -p <num_threads>
Identify variant clusters: bayesTyper cluster -v <candiate_variants_prefix>.vcf.gz -s <samples>.tsv -g <ref_build>_canon.fa -d <ref_build>_decoy.fa -p <num_threads>
bcftools concat
(see below).<samples>.tsv
file should contain one sample per row with columns Genotype variant clusters: bayesTyper genotype -v bayestyper_unit_<unit_id>/variant_clusters.bin -c bayestyper_cluster_data -s <samples>.tsv -g <ref_build>_canon.fa -d <ref_build>_decoy.fa -o bayestyper_unit_<unit_id>/bayestyper -z -p <num_threads>
Concatenate units using bcftools: bcftools concat -O z -o <output_prefix>.vcf.gz bayestyper_unit_1/bayestyper.vcf.gz bayestyper_unit_2/bayestyper.vcf.gz ...
Number of samples | Coverage | Number of variant alleles | Max allele length (nts) | Number of threads | Wall time (h, single node) | Wall time (h, multiple nodes)* | Max memory (GB) |
---|---|---|---|---|---|---|---|
3 | 13x | 21.4M | 500,000 | 28 | 5-6 | 2-3 | 41 |
3 | 13x | 64.4M | 500,000 | 28 | 17-18 | 4-5 | 42 |
13 | 50x | 11.7M | 10,000 | 28 | 31-32 | 16-17 | 66 |
13 | 50x | 61.1M | 10,000 | 28 | 92-93 | 15-16 | 62 |
* bayesTyper genotype
can be distributed across nodes on a cluster - between 2 and 11 nodes were used in this benchmark.
The time estimates are for running bayesTyper cluster
and bayesTyper genotype
only. Expect <1h combined run-time per sample for counting k-mers by KMC3 and bloom filter creation by bayesTyperTools. All runs were done on a 64-bit Intel Xeon 2.00 GHz machine with 128 GB of memory using v1.3.
Sibbesen JA, Maretty L, The Danish Pan-Genome Consortium & Krogh A: Accurate genotyping accross variant classes and lengths using variant graphs. Nature Genetics, 2018. link. *Equal contributors.
Please let us know if you use BayesTyper in your publication - then we will put it on the list.
Please post an issue if you have questions regarding how to run BayesTyper, if you want to report bugs or request new features. You can also reach us at jasi at binf dot ku dot dk or lasse dot maretty at clin dot au dot dk.
We thank the developers of the third-party libraries used by BayesTyper: