dipetkov / eems

Estimating Effective Migration Surfaces
GNU General Public License v2.0
102 stars 28 forks source link

error starting EEMS #23

Closed Angel-Popa closed 6 years ago

Angel-Popa commented 6 years ago

Dear author,

I was trying to use runeems_sats and I have run into some issues. I am working with DArTseq data, using the silicoDArT which in essence is a dataset of microsatellites. In this specific case, I have 57326 loci for 158 samples. I exported all the files I need it to work with EEMS, and after loading them I get the following error (see below).

Thanks in advance,

[Habitat::initialize] Done.

[Graph::initialize] Generate population grid and sample assignment Loaded sample coordinates from ../data/qfly-nIndiv158-nSites57326.coord There are 20 observed demes (out of 170 demes) The population grid has 170 demes and 446 edges There are 158 samples assigned to 20 observed demes [Graph::initialize] Done.

[Diffs::initialize] Read genotype data from ../data/qfly-nIndiv158-nSites57326.diffs [Diffs::initialize] Done.

Initialize EEMS random state

[EEMS::initialize_state] EEMS starts with 30 qtiles and 25 mtiles [EEMS::initialize_state] Done.

Input parameters: datapath = ../data/qfly-nIndiv158-nSites57326 mcmcpath = ../data/qfly-nIndiv158-nSites57326-EEMS-nDemes200-chain1 prevpath = gridpath = distance = euclidean diploid = 0 nIndiv = 158 nSites = 57326 nDemes = 200 seed = 1518919729 numMCMCIter = 2000000 numBurnIter = 1000000 numThinIter = 9999 negBiSize = 10 negBiProb = 0.670000 qVoronoiPr = 0.250000 mrateShape = 0.000500 qrateShape = 0.002000 sigmaShape = 0.001000 qrateScale = 0.500000 mrateScale = 2.000000 sigmaScale = 1.000000 mSeedsProposalS2 = 0.010000 qSeedsProposalS2 = 0.100000 mEffctProposalS2 = 0.100000 qEffctProposalS2 = 0.001000 mrateMuProposalS2 = 0.010000

Initial log prior: -32923.86 Initial log llike: inf

dipetkov commented 6 years ago

The microsatellite datasets I've tried eems on typically consists of a few dozens of microsatellites. The African elephant dataset for example has 16 microsatellites. Your dataset has 57326 loci. Can you describe in more detail how these loci are derived? (In short I am wondering whether runeems_sats is appropriate for your data and whether there might be some numerical under/overflow issue.)

Angel-Popa commented 6 years ago

Sorry for the mistake, I should have erased this issue as it is not a software issue. I made the mistake of using runeems_sats instead of runeems_snps. My dataset is generated using a genotype by sequencing method and although the dataset can be transformed into a microarray data with presence/absence format it does not reflects microsatellites (here is more information of the method for sequencing.

I did use runeems_snps and it worked it. :) However, I have a question, I calculated a dissimilarity matrix, using the R function bitwise.dist() from the package poppr but I am not very confident that this matrix is conditionally negative definite.

qfly-nH-nIndiv143-nSites3774.zip

Given that the results I have obtained are consistent with Discriminant Analysis of Principal Components. How would it affect the results if the matrix is not negative definite?

dipetkov commented 6 years ago

The runeems_snps implementation checks whether the input matrix is a valid distance matrix and it would print an error message if that's not the case. Here is a simple function in R does makes exactly the same check:

input <- read.table("qfly-nH-nIndiv143-nSites3774.diffs")

# Function to check if a matrix D is a valid Euclidean distance matrix
# by checking whether the matrix is conditionally negative definite:
# D must be symmetric, nonnegative, with diagonal of 0s,
# with 1 positive eigenvalue and n-1 negative eigenvalues
is_distmat <- function(D) {
  D <- as.matrix(D)
  a <- min(D >= 0)                       # Check that D is nonnegative
  b <- min(diag(D) >= 0 & diag(D) <= 0 ) # Check that D has a diagonal of 0s
  c <- isSymmetric(unname(D))            # Check that D is symmetric
  n <- nrow(D)
  val <- eigen(D)$values                 # Compute the eigenvalues
  tol <- 1e-12
  negative <- sum(val < -tol)            # How many eigenvalues are positive?
  positive <- sum(val > +tol)            # How many eigenvalues are positive?
  (a & b & c) & (positive == 1) & (negative == n-1)
}

is_distmat(input)
#> TRUE

The documentation specifies that the function poppr::bitwise.dist computes the Hamming distance. This is the number (or the average) of element-wise differences, with care taken as to how missing data is treated.

Here is a very simple example to illustrate how the Hamming distance and the Euclidean distance differ intuitively. [Euclidean distance is what the small program bed2diffs, also available in this repo, computes.]

hamming.distance <- function(x, y) {
  stopifnot(is.numeric(x),
            is.numeric(y),
            length(x) == length(y))
  mean(x != y)
}
euclidean.distance <- function(x, y) {
  stopifnot(is.numeric(x),
            is.numeric(y),
            length(x) == length(y))
  mean((x - y)^2)
}

x <- c(0, 0, 0)
y <- c(0, 0, 1)
z <- c(0, 0, 2)
hamming.distance(x, y)
#> 0.3333
hamming.distance(x, z)
#> 0.3333

euclidean.distance(x, y)
#> 0.3333
euclidean.distance(x, z)
#> 1.333

This shows that Euclidean distance puts more emphasis on large differences in some coordinates than the Hamming distance.