Rosemeis / pcangsd

Framework for analyzing low depth NGS data in heterogeneous populations using PCA.
GNU General Public License v3.0
46 stars 11 forks source link

PCAngsd

Version 1.35

Framework for analyzing low-depth next-generation sequencing (NGS) data in heterogeneous/structured populations using principal component analysis (PCA). Population structure is inferred by estimating individual allele frequencies in an iterative approach using a truncated SVD model. The covariance matrix is estimated using the estimated individual allele frequencies as prior information for the unobserved genotypes in low-depth NGS data.

The estimated individual allele frequencies can further be used to account for population structure in other probabilistic methods. PCAngsd can perform the following analyses:

Get PCAngsd and build

Dependencies

The PCAngsd software relies on the following Python packages that you can install through conda (recommended) or pip:

You can create an environment through conda easily or install dependencies through pip as follows:

# Conda environment
conda env create -f environment.yml

# pip
pip3 install --user -r requirements.txt

Install and build

git clone https://github.com/Rosemeis/pcangsd.git
cd pcangsd
pip3 install .

You can now run PCAngsd with the pcangsd command.

Usage

Running PCAngsd

PCAngsd works directly on genotype likelihood files or PLINK files.

# See all options
pcangsd -h

# Genotype likelihood file in Beagle format with 2 eigenvectors using 64 threads
pcangsd -b input.beagle.gz -e 2 -t 64 -o pcangsd
# Outputs by default log-file (pcangsd.log) and covariance matrix (pcangsd.cov)

# PLINK files (using file-prefix, *.bed, *.bim, *.fam)
pcangsd -p input.plink -e 2 -t 64 -o pcangsd

# Perform PC-based selection scan and estimate admixture proportions
pcangsd -b input.beagle.gz -e 2 -t 64 -o pcangsd --selection --admix
# Outputs the following files:
# log-file (pcangsd.log)
# covariance matrix (pcangsd.cov)
# selection statistics (pcangsd.selection)
# admixture proportions (pcangsd.admix.3.Q)
# ancestral allele frequencies (pcangsd.admix.3.F)

PCAngsd will output most files in text-format.

Quick example of reading output and creating PCA plot in R:

C <- as.matrix(read.table("pcangsd.cov")) # Reads estimated covariance matrix
D <- as.matrix(read.table("pcangsd.selection")) # Reads PC based selection statistics

# Plot PCA plot
e <- eigen(C)
plot(e$vectors[,1:2], xlab="PC1", ylab="PC2", main="PCAngsd")

# Obtain p-values from PC-based selection scan
p <- pchisq(D, 1, lower.tail=FALSE)

Read files in python and create PCA plot using matplotlib:

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import chi2
C = np.loadtxt("pcangsd.cov") # Reads estimated covariance matrix
D = np.loadtxt("pcangsd.selection") # Reads PC based selection statistics

# Plot PCA plot
evals, evecs = np.linalg.eigh(C)
evecs = evecs[:,::-1]
plt.scatter(evecs[:,0], evecs[:,1])
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCAngsd")
plt.show()

# Obtain p-values from PC-based selection scan
p = chi2.sf(D, 1)

Beagle genotype likelihood files can be generated from BAM files using ANGSD. For inference of population structure in genotype data with non-random missigness, we recommend our EMU software that performs accelerated EM-PCA, however with fewer functionalities than PCAngsd.

Citation

Please cite our papers:

Population structure: Inferring Population Structure and Admixture Proportions in Low-Depth NGS Data.\ HWE test: Testing for Hardy‐Weinberg Equilibrium in Structured Populations using Genotype or Low‐Depth NGS Data.\ Selection: Detecting Selection in Low-Coverage High-Throughput Sequencing Data using Principal Component Analysis.