fgvieira / ngsF-HMM

Estimation of per-individual inbreeding tracts under a probabilistic framework
GNU General Public License v3.0
13 stars 6 forks source link
genotype-likelihoods hmm inbreeding ngs population-genomics

ngsF-HMM

ngsF-HMM is a program to estimate per-individual inbreeding tracts using a two-state Hidden Markov Model (HMM). Furthermore, instead of using called genotypes, it uses a probabilistic framework that takes the uncertainty of genotype's assignation into account; making it specially suited for low-quality or low-coverage datasets.

Citation

ngsF-HMM was published in 2016 at Bioinformatics, so please cite it if you use it in your work:

Vieira FG, Albrechtsen A and Nielsen R
Estimating IBD tracts from low coverage NGS data
Bioinformatics (2016) 32: 2096-2102

Installation

ngsF-HMM can be easily installed but has some external dependencies:

To install the entire package just download the source code:

% git clone https://github.com/fgvieira/ngsF-HMM.git

and run:

% cd ngsF-HMM
% make

To run the tests (only if installed through ngsTools):

% make test

Executables are built into the main directory. If you wish to clean all binaries and intermediate files:

% make clean

Usage

% ./ngsF-HMM [options] --n_ind INT --n_sites INT --glf glf/in/file --out output/file

Parameters

Input data

As input, ngsF-HMM accepts both genotypes, genotype likelihoods (GP) or genotype posterior probabilities (GP). Genotypes must be input as gziped TSV with one row per site and one column per individual n_sites.n_ind and genotypes coded as [-1, 0, 1, 2]. The file can have a header and an arbitrary number of columns preceeding the actual data (that will all be ignored), much like the Beagle file format (link). As for GL and GP, ngsF-HMM accepts both gzipd TSV and binary formats, but with 3 columns per individual 3.n_sites.n_ind and, in the case of binary, the GL/GP coded as doubles.

Stopping Criteria

An issue on iterative algorithms is the stopping criteria. ngsF-HMM implements a dual condition threshold: relative difference in log-likelihood and estimates RMSD (F and freq). As for which threshold to use, simulations show that 1e-5 seems to be a reasonable value. However, if you're dealing with low coverage data (2x-3x), it might be worth to use lower thresholds (between 1e-6 and 1e-9).

To avoid convergence to local maxima, ngsF-HMM should be run several times from different starting points. To make this task easier, a script (ngsF-HMM.sh) is provided that can be called with the exact same parameters as ngsF-HMM.

Output files

ngsF-HMM will output several files, some depending on input options:

Thread pool

The thread pool implementation was adapted from Mathias Brossard's and is freely available from: https://github.com/mbrossard/threadpool