dipetkov / eems

Estimating Effective Migration Surfaces
GNU General Public License v2.0
102 stars 28 forks source link

eems with ngsDist matrix (dealing with missing data, invariant sites, uneven nSites) #65

Open TeresaPegan opened 2 years ago

TeresaPegan commented 2 years ago

Hi! I would like to try to use eems with a distance matrix I created using ngsDist (https://github.com/fgvieira/ngsDist). The genetic data I used with ngsDist to create the matrix are low-coverage whole genome data. I am excited about the possibility of applying eems to my data, but there are several ways in which my data and the resulting distance matrix differ from the expected input for runeems_snps. I was wondering if you have any thoughts about my ability to use eems/interpret its results despite these differences.

Here are the main differences: 1) My distance matrix was created using all sites in a chunk of a chromosome, including invariant sites--not just SNPs. 2) My data contain a lot of missing genotypes (due to the fact that they are low-coverage). ngsDist is built to handle this kind of thing and calculates genetic distance using genotype likelihoods. For each pair of individuals, it only considers sites where both individuals had reads. 3) Because of 2), there is no single "nSites" that applies to all pairs. The number of sites used for each pair depends on how much missing data the 2 individuals had.

If I just use an arbitrary nSites (ignoring missing data), I can successfully run eems and get a result. I am not sure if the result is interpretable though -- for example, I am attaching a few of the resulting plots, and they show very little information. I am not sure if this might be related to the fact that I am not supplying data in the intended way, and/or my arbitrary choice of nSites. Please note that this was a relatively short run to see what happened (200k iterations) and I would run it longer for a final result.

Let me know if anyone has any advice! Thanks! -Teresa out-mrates01

out-mrates02

Params file (without the paths). The nSites I supplied is really arbitrary, I don't have information from ngsDist about how many sites were retained/discarded in each case, though I could calculate it myself if need be.

nIndiv = 41
nSites = 100000
nDemes = 1000
diploid = true
numMCMCIter = 200000
numBurnIter = 100000
numThinIter = 999