23andMe / yhaplo

Identifying Y-chromosome haplogroups in arbitrarily large samples of sequenced or genotyped men
Other
103 stars 24 forks source link

Yhaplo | Identifying Y-Chromosome Haplogroups

python

David Poznik, 23andMe

Overview

yhaplo identifies the Y-chromosome haplogroup of each male in a sample of one to millions. It does not rely on any particular genotyping modality or platform, and it is robust to missing data, genotype errors, mutation recurrence, and other complications. Although full sequences yield the most granular haplogroup classifications, genotyping arrays can yield reliable calls, provided a reasonable number of phylogenetically informative variants has been assayed.

Briefly, haplogroup calling involves two steps. The program first builds an internal representation of the Y-chromosome phylogeny by reading its primary structure from (Newick-formatted) text and importing phylogenetically informative SNPs from the ISOGG database, affiliating each SNP with the appropriate node and growing the tree as necessary. It then traverses the tree for each individual, identifying the path of derived alleles leading to a haplogroup designation.

yhaplo is available for non-commercial use pursuant to the terms of the non-exclusive license agreement, LICENSE.txt. To learn more about the algorithm, please see our bioRxiv preprint:

Poznik GD. 2016. Identifying Y-chromosome haplogroups in arbitrarily large samples
of sequenced or genotyped men. bioRxiv doi: 10.1101/088716

To learn more about the software, please see the manual, yhaplo_manual.pdf.

For an overiew of command-line options, install the package and run yhaplo --help.

Contents

Installation

Basic installation

To install:

git clone git@github.com:23andMe/yhaplo.git
cd yhaplo
pip install --editable .

To update:

cd /path/to/yhaplo
git pull  # Update code
pip install --editable .  # Update version number

Optional dependencies

To include optional dependencies for various features:

To install multiple optional features, use a comma-separated list. For example:

pip install --editable .[vcf,plot]

Testing

Running on example data

To run on example text data:

yhaplo --example_text

The --example_text option tells yhaplo to run on a subset of 1000 Genomes data in sample-major text format. It also sets the --all_aux_output flag to produce all auxiliary output.

Similarly, to run on example VCF data:

yhaplo --example_vcf

Unit tests

To run unit tests:

make test

Caveats

Please note the following caveats before running yhaplo:

If, for a given individual, yhaplo observes no derived alleles at ISOGG SNPs on the upper branches of the Y-chromosome phylogeny, it will call the individual haplogroup "A," since all human Y-chromosome lineages are technically sublineages of A. Before concluding that the individual sample belongs to paragroup A (which includes haplogroups A00, A0, A1a, and A1b1), run with the --anc_snps option, and check the auxiliary output for ancestral alleles at haplogroup-BT SNPs. If you do not see any, your data set probably violates one or more of the assumptions listed above.

In particular, "variants-only" VCF files restrict to SNPs at which alternative alleles were observed, but ref/alt status is unimportant to yhaplo. What is important is ancestral/derived status. The reference sequence contains many derived alleles, and yhaplo will not be happy if you discard these valuable data. So please emit all confident sites when calling variants. To limit file size, you could safely restrict to positions in output/isogg.snps.unique.DATE.txt, as these are the only SNPs yhaplo considers. To generate this file, just run yhaplo with no arguments.

Input

The following input file types are supported:

In addition, the API supports running on a mapping of individual identifiers to 23andMe ablocks.

Output

All output file formats are described in detail in yhaplo_manual.pdf.

The two primary output files are:

  1. log.project_name.txt Log file containing details of the run
  2. haplogroups.project_name.txt Haplogroup calls. The 4 columns are:
    1. ID
    2. Haplogroup short form, with the name of a SNP observed in the derived state
    3. Haplogroup short form, with the name of a representative SNP
    4. Haplogroup long form, using Y-Chromosome Consortium nomenclature

yhaplo also produces a number of SNP tables, tree files, and auxiliary output files.
Please see yhaplo_manual.pdf and yhaplo --help for details.

API

See yhaplo/api/call_haplogroups.py.

CLI

The main command-line entry-point is yhaplo. Additional commands include:

Implementation details

Package data

Tree

The primary structure of the Y-chromosome tree is stored in yhaplo/data/tree/y.tree.primary.DATE.nwk.

Variants

Variant metadata are stored in yhaplo/data/variants/:

Classes

Trees

The Tree class is defined in tree.py. It:

Nodes

The Node class is defined in node.py. It:

SNPs

The SNP class and related classes are defined in snp.py:

Samples

The Sample class and its subclasses are defined in sample.py:

Paths

The Path class is defined in path.py. It represents a ath through a tree and stores:

Configuration

The Config class is defined in config.py. It is a container for parameters, command-line options, and filenames.