vcfdo
: do stuff with a VCF fileTo use vcfdo
requires a working Python 3 installation as well as the following packages:
Assuming the user either can write to the global site-packages
directory (ie. has root access or a conda
-type environment), vcfdo
can be installed straight from Github with pip
:
pip install git+https://github.com/IDEELResearch/vcfdo
Dependencies will be automatically installed or updated as required. The main executable, also called vcfdo
, will be made available on the user's $PATH
.
Utilities in vcfdo
are compartmentalized into sub-commands in the style of samtools
or bcftools
. To see a list, run vcfdo --help
to see a message like:
usage:
vcfdo <command> [options]
Commands:
--- Basic statistics
missing rates of missing calls by site or by sample
depthsummary quantiles of read depth (at called sites) per sample
filtersummary tallies of filter status by site or by sample
--- Downsampling
thin just emit every nth site
prune greedy LD-pruning in sliding windows
--- Ancestral alleles
polarize define ancestral alleles based on outgroup sequence
titv annotate sites as transitions or transversions; flag gBGC candidates
spectrum tally mutational spectrum: count SNVs by type (eg. A>C, C>T, ...)
derived count derived alleles per sample, using either read counts or hard calls
--- Allele frequencies
wsaf calculate within-sample allele frequency (WSAF) and related quantities from read counts
fws estimate pseudo-inbreeding coefficient F_ws ("within-sample relatedness")
wsafhist calculate histogram of WSAFs within samples for QC
private annotate sites where non-ref allele is only found in certain subset of samples
sfs approximate the n-dimensional unfolded SFS by sampling, possibly in windows
--- Genotyping
haplocall convert diploid to pseudo-haploid call by taking consensus allele at each site
--- Relatedness/ordination
dist calculate LD-weighted pairwise distances from WSAFs
ibs calculate simple identity-by-state (IBS) matrix from hard calls at polymorphic sites only
ibd calculate identity-by-descent (IBD) matrix from WSAFs and Malecot's (1970) definition of IBD
pca perform PCA using within-sample allele frequencies instead of hard calls
--- Admixture
f3stat calculate Patterson's normalized f_3 statistic
f4stat calculate Patterson's normalized f_4 statistic, also known as the D-statistic
treemix make input file for `TreeMix` software
Some options are common to all commands; these are
-h, --help show this help message and exit
-i INFILE, --infile INFILE
VCF file, possibly gzipped [default: stdin]
-n NSITES, --nsites NSITES
stop after processing this many sites [default: no
limit]
--ref-is-ancestral assume reference allele is the ancestral one, ignoring
INFO/AA (USE WITH CAUTION)
--chunk-size CHUNK_SIZE
when loading genotypes into memory, process this many
sites per chunk [default: 2000]
--seed SEED random number seed for reproducibility, values < 0 use
auto seed; only affects stuff involving sampling
[default: 1]
-q, --quiet see fewer logging messages
-v, --verbose see more logging messages
All of the utilities in vcfdo
work on streams, and are intended to be used that way. The goal is to store on disk only those attributes of a VCF that are invariant under subsetting by rows (sites) or columns (samples). Everything else can be calculated on the fly as needed. Simple filtering operations such as those available in bcftools view
are not duplicated in vcfdo
-- wherever possible, it will be more efficient to do what you can with bcftools view
and stream the result to vcfdo
.
On that note, the VCF parser used in vcfdo
is cyvcf2
, which itself is a thin Python wrapper around htslib
, so any VCF that is compatible with bcftools
should work with vcfdo
.
For example, suppose we have a (gzipped) VCF file.vcf.gz
and a list of samples of interest in samples.txt
. To perform PCA on the within-sample allele frequencies (WSAF), keeping only sites with population-level minor allele frequency (PLMAF) > 0.01, the command would be:
bcftools view --samples-file samples.txt file.vcf.gz | \
vcfdo wsaf | \
bcftools view -i 'INFO/PLMAF > 0.01' | \
vcfdo pca -o my_pca
This produces a PCA result in a pair of files my_pca.pca
(sample projections) and my_pca.pca.ev
(eigenvalues).
-n {number of sites to visit}
. It is recommended to test-drive commands on a small number of sites before unleashing them in a multi-gigabyte file.--quiet
. To see lower-level debugging messages, pass --verbose
.stdout
.stdin
. An unfortunate side-effect that I haven't been able to fix is that calling vcfdo <command>
with no arguments and no piped input will cause the command-line to hang until the user interrupts it with Ctrl-D
(not Ctrl-C
; why??)vcfdo
will check sample names against the VCF header and remove duplicates. Samples requested but not present in the VCF are ignored with a warning.As the target audience for this tool is people working on Plasmodium spp, heavy use is made of some quantities that are bespoke to malariologists. These include:
FORMAT
field)
INFO
field)
These were introduced (I think) in Manske et al (2012) Nature Genetics 487: 375-379 (PMC3738909).
To avoid spawning not-a-number warnings in downstream tools, vcfdo
encodes missing values for these quantities as -1.
The normalized f3 and f4 (aka D) statistics calculated by vcfdo f3stat
and vcfdo f4stat
, respectively, use the equations in Patterson et al (2012) Genetics 192: 1065-1093 (PMC3522152). Missing or invalid values for the site-wise allele frequencies that enter the estimators are silently ignored with the hope that the number of sites analyzed is sufficient that the rate of missingness will be negligible. I have not confirmed that my code gives equivalent results to ADMIXTOOLS
.
The design of vcfdo
is modular: multiple tools may need to be strung together to accomplish a given task. In most cases this comes down to creating and then using site- or genotype-level annotations in the VCF. The table below summarises the annotations required and/or produced by each tool in vcfdo
. Annotations in the recalculates column are re-computed on the fly before doing any work, but are not updated in the source VCF.
tool | requires | adds | recalculates |
---|---|---|---|
thin |
none | none | none |
prune |
none | none | none |
wsaf |
FORMAT/AD |
FORMAT/WSAF , FORMAT/WSMAF , INFO/PLAF , INFO/PLMAF |
none |
fws |
FORMAT/WSAF |
none | INFO/PLMAF |
pca |
FORMAT/WSAF |
none | INFO/PLMAF |
dist |
FORMAT/WSAF |
none | none |
ibd |
FORMAT/WSAF |
none | INFO/PLMAF |
ibs |
none | none | none |
polarize |
none | INFO/AA |
none |
titv |
INFO/AA |
INFO/Transversion , INFO/StrongWeak , INFO/BGC |
none |
derived |
INFO/AA |
none | none |
private |
FORMAT/WSAF (with --reads ) |
INFO/{flag} , where {flag} is specified by user |
none |
sfs |
INFO/AA (unless --ref-is-ancestral ), FORMAT/WSAF (with --reads ) |
none | none |
Given the constraints implied above, a reasonable compromise between time wasted on re-calculation and space wasted on a bloated VCF would be to run vcfdo wsaf
and vcfdo polarize
on a "master" call set and save the result. Then, since INFO/AA
(ancestral allele) and FORMAT/WSAF
(within-sample allele frequency) are invariant to subsetting on rows or columns, all other analyses can be done on a stream later.
Most operations in vcfdo
are deterministic up to floating-point weirdness.
The allele-count sampler in vcfdo sfs
uses the Mersenne twister RNG in Numpy. The seed can be set on the command line with --seed
, and is fixed to 1
by default -- which means successive calls with the same input will return the same output. That's good for reproducibility but definitely not the desired behavior if one actually wants to sample from the SFS. To allow auto-seeding of the RNG (ie. allow different output each run) pass a seed strictly less than 0.
If you uncover an error or unexpected behavior, of which there will be plenty, please file a Github issue and contact the author via Slack.