IDEELResearch / vcfdo

Some tools for processing and annotating VCF files, with focus on Plasmodium spp
MIT License
6 stars 4 forks source link

vcfdo: do stuff with a VCF file

Installation

To use vcfdo requires a working Python 3 installation as well as the following packages:

Assuming the user either can write to the global site-packages directory (ie. has root access or a conda-type environment), vcfdo can be installed straight from Github with pip:

pip install git+https://github.com/IDEELResearch/vcfdo

Dependencies will be automatically installed or updated as required. The main executable, also called vcfdo, will be made available on the user's $PATH.

Use patterns

Utilities in vcfdo are compartmentalized into sub-commands in the style of samtools or bcftools. To see a list, run vcfdo --help to see a message like:

usage:
vcfdo <command> [options]

Commands:

  --- Basic statistics
      missing           rates of missing calls by site or by sample
      depthsummary      quantiles of read depth (at called sites) per sample
      filtersummary     tallies of filter status by site or by sample

  --- Downsampling
      thin              just emit every nth site
      prune             greedy LD-pruning in sliding windows

  --- Ancestral alleles
      polarize          define ancestral alleles based on outgroup sequence
      titv              annotate sites as transitions or transversions; flag gBGC candidates
      spectrum          tally mutational spectrum: count SNVs by type (eg. A>C, C>T, ...)
      derived           count derived alleles per sample, using either read counts or hard calls

  --- Allele frequencies
      wsaf              calculate within-sample allele frequency (WSAF) and related quantities from read counts
      fws               estimate pseudo-inbreeding coefficient F_ws ("within-sample relatedness")
      wsafhist          calculate histogram of WSAFs within samples for QC
      private           annotate sites where non-ref allele is only found in certain subset of samples
      sfs               approximate the n-dimensional unfolded SFS by sampling, possibly in windows

  --- Genotyping
      haplocall         convert diploid to pseudo-haploid call by taking consensus allele at each site

  --- Relatedness/ordination
      dist              calculate LD-weighted pairwise distances from WSAFs
      ibs               calculate simple identity-by-state (IBS) matrix from hard calls at polymorphic sites only
      ibd               calculate identity-by-descent (IBD) matrix from WSAFs and Malecot's (1970) definition of IBD
      pca               perform PCA using within-sample allele frequencies instead of hard calls

  --- Admixture
      f3stat            calculate Patterson's normalized f_3 statistic
      f4stat            calculate Patterson's normalized f_4 statistic, also known as the D-statistic
      treemix           make input file for `TreeMix` software

Some options are common to all commands; these are

-h, --help            show this help message and exit
-i INFILE, --infile INFILE
                      VCF file, possibly gzipped [default: stdin]
-n NSITES, --nsites NSITES
                      stop after processing this many sites [default: no
                      limit]
--ref-is-ancestral    assume reference allele is the ancestral one, ignoring
                      INFO/AA (USE WITH CAUTION)
--chunk-size CHUNK_SIZE
                      when loading genotypes into memory, process this many
                      sites per chunk [default: 2000]
--seed SEED           random number seed for reproducibility, values < 0 use
                      auto seed; only affects stuff involving sampling
                      [default: 1]
-q, --quiet           see fewer logging messages
-v, --verbose         see more logging messages

All of the utilities in vcfdo work on streams, and are intended to be used that way. The goal is to store on disk only those attributes of a VCF that are invariant under subsetting by rows (sites) or columns (samples). Everything else can be calculated on the fly as needed. Simple filtering operations such as those available in bcftools view are not duplicated in vcfdo -- wherever possible, it will be more efficient to do what you can with bcftools view and stream the result to vcfdo.

On that note, the VCF parser used in vcfdo is cyvcf2, which itself is a thin Python wrapper around htslib, so any VCF that is compatible with bcftools should work with vcfdo.

For example, suppose we have a (gzipped) VCF file.vcf.gz and a list of samples of interest in samples.txt. To perform PCA on the within-sample allele frequencies (WSAF), keeping only sites with population-level minor allele frequency (PLMAF) > 0.01, the command would be:

bcftools view --samples-file samples.txt file.vcf.gz | \
vcfdo wsaf | \
bcftools view -i 'INFO/PLMAF > 0.01' | \
vcfdo pca -o my_pca

This produces a PCA result in a pair of files my_pca.pca (sample projections) and my_pca.pca.ev (eigenvalues).

General remarks

Definitions and conventions

As the target audience for this tool is people working on Plasmodium spp, heavy use is made of some quantities that are bespoke to malariologists. These include:

These were introduced (I think) in Manske et al (2012) Nature Genetics 487: 375-379 (PMC3738909).

To avoid spawning not-a-number warnings in downstream tools, vcfdo encodes missing values for these quantities as -1.

The normalized f3 and f4 (aka D) statistics calculated by vcfdo f3stat and vcfdo f4stat, respectively, use the equations in Patterson et al (2012) Genetics 192: 1065-1093 (PMC3522152). Missing or invalid values for the site-wise allele frequencies that enter the estimators are silently ignored with the hope that the number of sites analyzed is sufficient that the rate of missingness will be negligible. I have not confirmed that my code gives equivalent results to ADMIXTOOLS.

Order of operations

The design of vcfdo is modular: multiple tools may need to be strung together to accomplish a given task. In most cases this comes down to creating and then using site- or genotype-level annotations in the VCF. The table below summarises the annotations required and/or produced by each tool in vcfdo. Annotations in the recalculates column are re-computed on the fly before doing any work, but are not updated in the source VCF.

tool requires adds recalculates
thin none none none
prune none none none
wsaf FORMAT/AD FORMAT/WSAF, FORMAT/WSMAF, INFO/PLAF, INFO/PLMAF none
fws FORMAT/WSAF none INFO/PLMAF
pca FORMAT/WSAF none INFO/PLMAF
dist FORMAT/WSAF none none
ibd FORMAT/WSAF none INFO/PLMAF
ibs none none none
polarize none INFO/AA none
titv INFO/AA INFO/Transversion, INFO/StrongWeak, INFO/BGC none
derived INFO/AA none none
private FORMAT/WSAF (with --reads) INFO/{flag}, where {flag} is specified by user none
sfs INFO/AA (unless --ref-is-ancestral), FORMAT/WSAF (with --reads) none none

Given the constraints implied above, a reasonable compromise between time wasted on re-calculation and space wasted on a bloated VCF would be to run vcfdo wsaf and vcfdo polarize on a "master" call set and save the result. Then, since INFO/AA (ancestral allele) and FORMAT/WSAF (within-sample allele frequency) are invariant to subsetting on rows or columns, all other analyses can be done on a stream later.

Reproducibility

Most operations in vcfdo are deterministic up to floating-point weirdness.

The allele-count sampler in vcfdo sfs uses the Mersenne twister RNG in Numpy. The seed can be set on the command line with --seed, and is fixed to 1 by default -- which means successive calls with the same input will return the same output. That's good for reproducibility but definitely not the desired behavior if one actually wants to sample from the SFS. To allow auto-seeding of the RNG (ie. allow different output each run) pass a seed strictly less than 0.

Bug reports

If you uncover an error or unexpected behavior, of which there will be plenty, please file a Github issue and contact the author via Slack.