This repository includes a python implementation of G-Nomix, a fast, scalable, and accurate local ancestry method. See demo.
G-Nomix can be used in two ways:
In both cases the models are used to infer local ancestry on provided query data that has already been phased (using a program like beagle, shapeit, or eagle) and pre-processed to have the same sites as the reference training samples on the same strand, or if a pre-trained model is used instead see Pre-Trained Models below for requirements.
To install the software, navigate to the desired folder and enter in the command line interface:
git clone https://github.com/AI-sandbox/gnomix
cd gnomix
The dependencies are listed in requirements.txt. Assuming pip is already installed, they can be installed via
$ pip install -r requirements.txt
The combined runtime for the cloning and the dependency installation should be around 2 minutes on a normal laptop.
The software has been tested in Python 3.7.4 on the following operating systems:
gnomix.py loads and uses a pre-trained G-Nomix model to predict the ancestry for a given
To execute the program with a pre-trained model run:
$ python3 gnomix.py <query_file> <output_folder> <chr_nr> <phase> <path_to_model>
where
In order to incorporate our pre-trained models into your pipeline, please use the following command to download pre-trained models for the whole human genome. The SNPs used for our pre-trained models are also included in the form of a plink .bim file for every chromosome.
sh download_pretrained_models.sh
This creates a folder called pretrained_gnomix_models. For each chromosome, we publish a default_model.pkl which can be used as a pre-trained model in the <path_to_model> field and a .bim file as explained above.
When making predictions, the input to the model is an intersection of the pre-trained model SNP positions and the SNP positions from the
The models named default_model.pkl are trained on hg build 37 references from the following biogeographic regions: Subsaharan African (AFR), East Asian (EAS), European (EUR), Native American (NAT), Oceanian (OCE), South Asian (SAS), and West Asian (WAS) and the model labels and predicts them as 0, 1, .., 6 respectively. The populations used to train these ancestries are given in the supplementary section of the reference provided at the bottom of this readme.
To execute the program when training a model run:
$ python3 gnomix.py <query_file> <output_folder> <chr_nr> <phase> <genetic_map_file> <reference_file> <sample_map_file>
where the first 4 arguments are described above in the pre-trained setting and
The program uses these two files as input to our simulation algorithm (see pyadmix/) to create training data for the model. Also, note that when running inference on the trained models, the <query_file> needs to have the same build as the genetic map used to train the model. (For instance, in the case of humans, it is build37 or build38)
After downloading our pre-trained models, one can demo the software in inference mode by running:
python3 gnomix.py demo/data/small_query_chr22.vcf.gz demo_output 22 True pretrained_gnomix_models/chr22/model_chm_22.pkl
This small query file contains only 9 samples of European, East Asian and African ancestry. The execution should take around a minute on a standard laptop. The inference can be analyzed, for example in the file demo_output/quer_results.msp, where we expect to see those three ancestries being inferred. For more details on those analysis, see the section on output below.
For more demos with training and larger datasets, see the demo notebook demo.ipynb.
More advanced configuration settings can be found in config.yaml. They include general settings, simulation settings and model settings. More details are given in the file itself. If training a model from scratch you can also pass an alternative config file as the last argument:
$ python3 gnomix.py <query_file> <output_folder> <chr_nr> <phase> <genetic_map_file> <reference_file> <sample_map_file> <config_file>
If no config is given, the program uses the default (config.yaml). The config file has advanced training options. Some of the parameters are
For more base + smoother combinations one can edit the gnomix.py file in the following way:
import the base model of choice from src/base/model e.g.,
from src.Base.models import LogisticRegressionBase
import the smoother of choice from src/smooth/model e.g.,
from src.Smooth.models import XGB_Smoother
and then, in the train_model() function in initilize the Gnomix object with the imported models:
model = Gnomix(
...,
base = LogisticRegressionBase,
smooth = XGB_Smoother,
...
)
The results (including predictions, trained models and analysis) are stored in the
The inference is written to two files, one for a single ancestry estimates for each marker (qery_results.msp) and one for probability estimates for each ancestry at each marker (query_results.fb). Below, we describe the both files in more detail.
In the query_results.msp file, the first line is a comment line, that specifies the order and encoding of populations, eg:
The second line specifies the column names, and every following line marks an interval on the genome.
The first 6 columns specify
The remaining columns give the predicted reference panel population for the given interval. A genotype has two haplotypes, so the number of predictions for a genotype is 2(number of genotypes) and therefore the total number of columns in the file is 6 + 2(number of genotypes)
In the query_results.fb file, the first line is a comment line, that specifies the order of the populations, eg:
The second line specifies the column names, and every following line marks an interval on the genome.
The first 4 columns specify
The remaining columns represent the query hapotypes and reference panel population and each line markes the estimated probability of the given genome position coming from the population. A genotype has two haplotypes, so the number of predictions for a genotype is 2(number of genotypes)(number of reference populations) and therefore the total number of columns in the file is 6 + 2(number of genotypes)(number of reference populations).
The query_results.lai is an optional output that includes the inferred ancestry label for each marker in the query file. Please note that this feature is in beta stage and therefore the program does not export this file unless snp_level_inference is set to True in the config.yaml file.
The first line of the output file is a comment line, that specifies the order and encoding of populations, eg:
just like in the msp file.
The second line specifies the column names, and every following line marks a genome position.
The first column indicates the physical position of the SNP and the remaining columns give the predicted reference panel population for the given interval. A genotype has two haplotypes, so the number of predictions for a genotype is 2(number of genotypes) and therefore the total number of columns in the file is 1 + 2(number of genotypes).
When using Gnofix for phasing error correcting (See Phasing below), the inference above will be performed on the query haplotype phased by Gnofix. These phased haplotypes will then also be exported to query_file_phased.vcf in the
To visualize the local ancestry output along the chromosome using tagore for plotting, use the visualize_inference True option in the config file.
When training a model, the resulting model will be stored in
The program simulates training data and stores it in
Accurate phasing of genomic data is crucial for human demographic modeling and identity-by-descent analyses. It has been shown that leveraging information about an individual’s genomic ancestry improves performance of current phasing algorithms. Gnofix is a method that uses local ancestry inference to do exactly that. If you suspect your data might have phasing errors (generally the case unless trio phasing was possible), we recommend using this option <phase> as True. See the gnofix/ folder if interested in more details on the algorithm.
Sequenced haplotypes phased with a phasing software (left). LAI is used to label haplotypes with ancestry predictions and phasing errors become evident (center). Phasing error correction using LAI is applied to correct phasing errors (right). Small numbers of phasing errors do not, however, impact the correct association of a variant with an ancestry, and so are typically only a visual nuisance.
To ensure that G-Nomix outputs probability estimates that reflect it's true confidence and accuracy, we recommend using calibration. We use Isotonic Regression to map the predicted probabilities to calibrated probabilities where the latter are more likely to have predictions with a confidence of X% correct matching their actual X% frequency of being correct in practice.
NOTICE: This software is available free of charge for academic research use only. Commercial users and for profit companies or consultants can use the features present in this software by contacting Galatea Bio, to which Stanford Office of Technology Licensing has exclusively licensed technology used in this package. Academic users may fork this repository and modify and improve to suit their research needs, but also inherit these terms and must include a licensing notice to this effect.
https://www.biorxiv.org/content/10.1101/2021.09.19.460980v1
@article {Hilmarsson2021.09.19.460980,
author = {Hilmarsson, Helgi and Kumar, Arvind S and Rastogi, Richa and Bustamante, Carlos D and Mas Montserrat, Daniel and Ioannidis, Alexander G},
title = {High Resolution Ancestry Deconvolution for Next Generation Genomic Data},
elocation-id = {2021.09.19.460980},
year = {2021},
doi = {10.1101/2021.09.19.460980},
publisher = {Cold Spring Harbor Laboratory},
abstract = {As genome-wide association studies and genetic risk prediction models are extended to globally diverse and admixed cohorts, ancestry deconvolution has become an increasingly important tool. Also known as local ancestry inference (LAI), this technique identifies the ancestry of each region of an individual{\textquoteright}s genome, thus permitting downstream analyses to account for genetic effects that vary between ancestries. Since existing LAI methods were developed before the rise of massive, whole genome biobanks, they are computationally burdened by these large next generation datasets. Current LAI algorithms also fail to harness the potential of whole genome sequences, falling well short of the accuracy that such high variant densities can enable. Here we introduce G-Nomix, a set of algorithms that address each of these points, achieving higher accuracy and swifter computational performance than any existing LAI method, while also enabling portable models that are particularly useful when training data are not shareable due to privacy or other restrictions. We demonstrate G-Nomix (and its swift phase correction counterpart Gnofix) on worldwide whole-genome data from both humans and canids and utilize its high resolution accuracy to identify the location of ancient New World haplotypes in the Xoloitzcuintle, dating back over 100 generations. Code is available at https://github.com/AI-sandbox/gnomixCompeting Interest StatementCDB is the founder and CEO of Galatea Bio Inc and on the boards of Genomics PLC and Etalon.},
URL = {https://www.biorxiv.org/content/early/2021/09/21/2021.09.19.460980},
eprint = {https://www.biorxiv.org/content/early/2021/09/21/2021.09.19.460980.full.pdf},
journal = {bioRxiv}
}