dipetkov / eems

Estimating Effective Migration Surfaces
GNU General Public License v2.0
101 stars 28 forks source link

The diffs matrix is not full rand. #12

Open Jia21 opened 7 years ago

Jia21 commented 7 years ago

Dear author,

I met a problem that shows the diffs matrix is not full rank. Because my snps matrix only included the mtDNA bi-allelic SNPs, and they were just converted into the 0|0 or 1|1 format, I'm not sure whether it was related to this. if not, how to solve the non-full rank diff matrix issue? Could you please help? Thank you.

dipetkov commented 7 years ago

Hello

Can you say how many individuals and (bi-alleleic) SNPs you have in your dataset, after the transformation?

Desi

Jia21 commented 7 years ago

There are 1174 individuals and 1849 bi-allelic SNPs in my data.

dipetkov commented 7 years ago

In your dataset, there are more SNPs than individuals, so the dissimilarity matrix diffs should be a valid full-rank distance matrix if there are no missing genotypes. Usually however there is some missingness.

There are (at least) two ways to deal with missing genotypes:

  1. Ignore missing genotypes. That is, for a pair of individuals i and j, compute their dissimilarity only based on the subset of SNPs for which we know the genotype of both i and j.
  2. Impute missing genotypes with the observe mean genotype at the corresponding SNP. That is, impute NAs with 2*p_hat where p_hat is the observed allele frequency.

Method 1 doesn't necessarily produce a valid distance matrix. Method 2 produces a valid distance matrix as long as there are more SNPs than individuals. (There are some techinical condition about linear independence between the SNPs.)

If you have your dataset in plink format, then you can use the small bed2diffs program in this repo, to compute the dissimilarity matrix using either method 1 (bed2diffs_v1) or method 2 (bed2diffs_v2).

The rest of this post is some R code that you can use to understand the difference between the two versions better.

library("Matrix")

# Generate an nIndiv-by-nSites genotype matrix (with entries equal to 0, 1 or 2)
# Optionally, set (100 * propNAs)% of the genotypes to missing (NA)
generate_genotypes <- function(nIndiv, nSites, propNAs) {
  X <- rbinom(nIndiv * nSites, 2, 0.5)
  if (propNAs > 0) {
    nb_NAs <- round(nIndiv * nSites * propNAs)
    idx_NAs <- sample.int(nIndiv * nSites, size = nb_NAs, replace = FALSE)
    X[idx_NAs] <- NA
  }
  matrix(X, nrow = nIndiv, ncol = nSites)
}

# Implement `bed2diffs-v1`, which uses the "pairwise.complete.obs" method to compute differences. 
# For each pair of individuals i and j, compute their mean difference using only those SNPs
# for which we know the genotype of both i and j.
# This can result in a difference matrix that is not conditionally negative definite.
# This implementation uses a double loop so will be very slow for large genotype matrices.
bed2diffs_v1 <- function(genotypes) {
  nIndiv <- nrow(genotypes)
  nSites <- ncol(genotypes)
  diffs <- matrix(0, nIndiv, nIndiv)
  for (i in seq(nIndiv - 1)) {
    for (j in seq(i + 1, nIndiv)) {
      x <- genotypes[i, ]
      y <- genotypes[j, ]
      diffs[i, j] <- mean((x - y)^2, na.rm = TRUE)
      diffs[j, i] <- diffs[i, j]
    }
  }
  diffs
}

# Implement `bed2diffs-v2`, which imputes NAs with the observed genotype means.
# The dissimilarities matrix will be a valid distance matrix as long as nSites >= nIndiv and
# the SNPs are linearly independent.
# Technically, a matrix D is an Euclidean distance matrix if there exists a matrix X such that 
# Dij = (Xi - Xj) * (Xi - Xj) for all i, j. (Here the Xi's are genotype vectors.) 
# Since we impute NAs with the observed genotype means, we construct an X to satisfy 
# the definition and so Diffs is a valid distance matrix.
bed2diffs_v2 <- function(genotypes) {
  nIndiv <- nrow(genotypes)
  nSites <- ncol(genotypes)
  # a row of means
  means <- colMeans(genotypes, na.rm = TRUE)
  # a matrix with nIndiv identical rows of means
  means <- matrix(means, nrow = nIndiv, ncol = nSites, byrow = TRUE)
  # Set the means that correspond to observed genotypes to 0
  means[is.na(genotypes) == 0] <- 0 
  # Set the missing genotypes to 0 (used to be NA) 
  genotypes[is.na(genotypes) == 1] <- 0 
  genotypes <- genotypes + means
  # Compute similarities
  similarities <- genotypes %*% t(genotypes) / nSites
  self_sim <- diag(similarities) # self-similarities
  vector1s <- rep(1, nIndiv) # vector of 1s
  # The difference between x and y is (x - y)^2 = x^2 + y^2 - 2 * x * y
  self_sim %*% t(vector1s) + vector1s %*% t(self_sim) - 2 * similarities
}

# Function to check if a matrix D is a valid Euclidean distance matrix
# by checking whether the matrix is conditionally negative definite:
# D must be symmetric, nonnegative, with diagonal of 0s,
# with 1 positive eigenvalue and n-1 negative eigenvalues
is_distmat <- function(D) {
  a <- min(D >= 0)                       # Check that D is nonnegative
  b <- min(diag(D) >= 0 & diag(D) <= 0 ) # Check that D has a diagonal of 0s
  c <- isSymmetric(D)                    # Check that D is symmetric
  n <- nrow(D)
  val <- eigen(D)$values                 # Compute the eigenvalues
  tol <- 1e-12
  negative <- sum(val < -tol)            # How many eigenvalues are positive?
  positive <- sum(val > +tol)            # How many eigenvalues are positive?
  (a & b & c) & (positive == 1) & (negative == n-1)
}

# Both bed2diffs_v1 and bed2diffs_v2 produce the same distance matrix 
# if there are no missing genotypes at all.
#
# If nSites >= nIndiv, bed2diffs_v2 produces a valid distance matrix 
# even if some genotypes are missing.
#
# Even if nSites >= nIndiv, bed2diffs_v1 might not produce a valid distance matrix.
# This depends on the number of SNPs and the pattern of missingness. 
# If genotypes are missing at random and there are many more SNPs that individuals, 
# bed2diffs_v1 is likely to produce a valid distance matrix as well.

# 100 individuals, 120 SNPs, 10% missingness
# Version 2 produces a valid distance matrix, version 1 does not.
set.seed(1234)
nIndiv <- 100
nSites <- 120
propNAs <- 0.1

X <- generate_genotypes(nIndiv, nSites, propNAs) 
Diffs_v1 <- bed2diffs_v1(X)
Diffs_v2 <- bed2diffs_v2(X)

is_distmat(Diffs_v1)
is_distmat(Diffs_v2)

# 100 individuals, 1000 SNPs, 10% missingness
# There are many more SNPs than individuals, 
# so both version produce valid (though different) distance matrices
set.seed(1234)
nIndiv <- 100
nSites <- 1000
propNAs <- 0.1

X <- generate_genotypes(nIndiv, nSites, propNAs) 
Diffs_v1 <- bed2diffs_v1(X)
Diffs_v2 <- bed2diffs_v2(X)

is_distmat(Diffs_v1)
is_distmat(Diffs_v2)
JimStarrett commented 2 years ago

I generated the attached diffs file with the bed2diffs-v2 script in R to account for missing data, but get the error below when I run eems. There are no negative values, and the zeros are diagonal. I added the .txt so the file could be attached. Thank you.

Error: The dissimilarity matrix is not a full-rank distance matrix.

pro_diffs.diffs.txt