alexlancaster / pypop

PyPop: Python for Population Genomics
http://pypop.org
GNU General Public License v2.0
22 stars 15 forks source link

PyPOP LD: unclear whether LD is phased or unphased #153

Closed nabil161289 closed 1 year ago

nabil161289 commented 1 year ago

The calculated linkage measures are more consistent with unphased LD The output of PyPOP is compared to pould (Phased or Unphased Linkage Disequilibrium), -an R package that calculates phased and unphased LD and compares D' and Wn in phased and unphased data via a sign test- and the LD metrics are closer to the values of the unphased data.

To Reproduce: Steps to reproduce the PyPOP output:

  1. Install PyPOP via cmd: pip install pypop-genomics
  2. Download the minimal.ini and the USAFEL-UchiTelle-small.pop (available from https://github.com/alexlancaster/pypop#examples)
  3. From cmd, browse to files location, then execute: pypop -c tests/data/minimal.ini tests/data/USAFEL-UchiTelle-small.pop
  4. Examine the USAFEL-UchiTelle-small-out.txt Steps to reproduce the results using "pould":
library(pould)
## Input data
USAFEL <- matrix(c(NA,  NA, "01:02",    "02:025",   NA, NA,
"01:01",    NA, "03:07",    "06:05", "13:01",   "13:01",
"02:10",    "03:012",   "07:12",    "01:02",    "13:01",    "13:01",
"01:01",    "02:18",    "08:04", "12:02",   "13:01",    "13:01",
"25:01",    "02:01",    "15:07",    "03:07",    "13:01",    "13:01",
"02:10",    "32:04", "18:01",   "01:02", "13:01", "13:01",
"03:012",   "32:04",    "15:07",    "06:05",    "13:01",    "13:01",
"25:01",    "32:04",    "03:07",    "03:07",    "13:01",    "13:01",
"68:14",    "02:01",    "01:02",    "07:12",    "13:01",    "13:01",
"02:01",    "02:01",    "12:02",    "02:025",   "13:01",    "13:01"), 10, 6, TRUE)
colnames(USAFEL) <- c("A", "A", "C", "C", "B", "B")
USAFEL<- as.data.frame(USAFEL)
LDWrap(USAFEL, phased = FALSE, writeTo = getwd(), frameName = "USAFEL_data", threshold = 5)
LDWrap(USAFEL, phased = TRUE, writeTo = getwd(), frameName = "USAFEL_data", threshold = 5)

The outputs should be found as .csv files in the tempdir() or the working directory.

Screenshots Expected PyPOP output image

Expected "pould" output, unphased: image

Expected "pould" output, phased: image

Device: HP EliteBook 840 G2 Notebook PC OS: Windows 10 pro (Version 22H2) Python version: 3.11 PIP version: 23.2.1 PyPOP version: 1.0.0b7 R version: 4.2.0 Rstudio version: 2023.09.1+494 pould version: 1.0.1

nabil161289 commented 1 year ago

@alexlancaster

alexlancaster commented 1 year ago

The calculated linkage measures are more consistent with unphased LD The output of PyPOP is compared to pould (Phased or Unphased Linkage Disequilibrium), -an R package that calculates phased and unphased LD and compares D' and Wn in phased and unphased data via a sign test- and the LD metrics are closer to the values of the unphased data.

Hi @nabil161289 - thanks for the report. I am pretty sure that PyPop (and the [Emhaplofreq] module, specifically), does not treat the alleles in the columns _1 and _2 as if they were phased for LD and haplotype estimation purposes. Nor is there currently a flag to treat them as if they were phased data, as far as I know. The primary author of that module, @rsingle should be able to confirm that. It is in the documentation, here in bold, below

http://pypop.org/docs/guide-chapter-usage.html#sample-files

Note that for genotype data, each locus corresponds to two columns in the population file. The locus name must repeated, with a suffix such as _1, _2(the default) or _a, _b and must match the format defined in the config.ini (see validSampleFields). Although PyPop needs this distinction to be made, phase is NOT assumed, and if known it is ignored.

although it could probably be made more clear. So this issue is probably mostly a documentation one. Adding support for phased genotypes would be considered an enhancement.

To Reproduce: Steps to reproduce the PyPOP output:

1. Install PyPOP via cmd: pip install pypop-genomics

2. Download the minimal.ini and the USAFEL-UchiTelle-small.pop (available from https://github.com/alexlancaster/pypop#examples)

3. From cmd, browse to files location, then execute: pypop -c tests/data/minimal.ini tests/data/USAFEL-UchiTelle-small.pop

4. Examine the USAFEL-UchiTelle-small-out.txt
   Steps to reproduce the results using "pould":

...

Screenshots Expected PyPOP output image

Expected "pould" output, unphased: image

Expected "pould" output, phased: image

Thanks for the detailed description for how to the LD estimation via pould - that is a useful tool to compare our results to. And because PyPop assumes everything is unphased, would explain why the output from PyPop shows LD values similar to the one's from pould in R. It's reassuring that they are very close - although obviously they will be slightly different due to slightly different random number seeds, etc.

alexlancaster commented 1 year ago

I added a proposed extra note in a new documentation branch - will convert to a PR if this looks good, @rsingle

alexlancaster commented 1 year ago

confirmed by @rsingle in https://github.com/alexlancaster/pypop/issues/151#issuecomment-1786332474 - that data is assumed to be unphased, but we will update the docs to make this even more clear.

sjmack commented 12 months ago

As far as I recall, PyPop has never generated phased LD values.

alexlancaster commented 10 months ago

As far as I recall, PyPop has never generated phased LD values.

That is correct. I believe the idea is that once we port to haplo.stats we can include phased LD as part of it (the implementation would be in Python, rather than the C module).