Closed nabil161289 closed 1 year ago
@alexlancaster
The calculated linkage measures are more consistent with unphased LD The output of PyPOP is compared to pould (Phased or Unphased Linkage Disequilibrium), -an R package that calculates phased and unphased LD and compares D' and Wn in phased and unphased data via a sign test- and the LD metrics are closer to the values of the unphased data.
Hi @nabil161289 - thanks for the report. I am pretty sure that PyPop (and the [Emhaplofreq]
module, specifically), does not treat the alleles in the columns _1
and _2
as if they were phased for LD and haplotype estimation purposes. Nor is there currently a flag to treat them as if they were phased data, as far as I know. The primary author of that module, @rsingle should be able to confirm that. It is in the documentation, here in bold, below
http://pypop.org/docs/guide-chapter-usage.html#sample-files
Note that for genotype data, each locus corresponds to two columns in the population file. The locus name must repeated, with a suffix such as
_1
,_2
(the default) or_a
,_b
and must match the format defined in the config.ini (see validSampleFields). Although PyPop needs this distinction to be made, phase is NOT assumed, and if known it is ignored.
although it could probably be made more clear. So this issue is probably mostly a documentation one. Adding support for phased genotypes would be considered an enhancement.
To Reproduce: Steps to reproduce the PyPOP output:
1. Install PyPOP via cmd: pip install pypop-genomics 2. Download the minimal.ini and the USAFEL-UchiTelle-small.pop (available from https://github.com/alexlancaster/pypop#examples) 3. From cmd, browse to files location, then execute: pypop -c tests/data/minimal.ini tests/data/USAFEL-UchiTelle-small.pop 4. Examine the USAFEL-UchiTelle-small-out.txt Steps to reproduce the results using "pould":
...
Screenshots Expected PyPOP output
Expected "pould" output, unphased:
Expected "pould" output, phased:
Thanks for the detailed description for how to the LD estimation via pould
- that is a useful tool to compare our results to. And because PyPop assumes everything is unphased, would explain why the output from PyPop shows LD values similar to the one's from pould
in R. It's reassuring that they are very close - although obviously they will be slightly different due to slightly different random number seeds, etc.
I added a proposed extra note in a new documentation branch - will convert to a PR if this looks good, @rsingle
confirmed by @rsingle in https://github.com/alexlancaster/pypop/issues/151#issuecomment-1786332474 - that data is assumed to be unphased, but we will update the docs to make this even more clear.
As far as I recall, PyPop has never generated phased LD values.
As far as I recall, PyPop has never generated phased LD values.
That is correct. I believe the idea is that once we port to haplo.stats
we can include phased LD as part of it (the implementation would be in Python, rather than the C module).
The calculated linkage measures are more consistent with unphased LD The output of PyPOP is compared to pould (Phased or Unphased Linkage Disequilibrium), -an R package that calculates phased and unphased LD and compares D' and Wn in phased and unphased data via a sign test- and the LD metrics are closer to the values of the unphased data.
To Reproduce: Steps to reproduce the PyPOP output:
The outputs should be found as .csv files in the tempdir() or the working directory.
Screenshots Expected PyPOP output
Expected "pould" output, unphased:
Expected "pould" output, phased:
Device: HP EliteBook 840 G2 Notebook PC OS: Windows 10 pro (Version 22H2) Python version: 3.11 PIP version: 23.2.1 PyPOP version: 1.0.0b7 R version: 4.2.0 Rstudio version: 2023.09.1+494 pould version: 1.0.1