alexlancaster / pypop

PyPop: Python for Population Genomics
http://pypop.org
GNU General Public License v2.0
22 stars 15 forks source link

HWP output issues: missing common genotypes and incorrect Guo & Thompson stats #17

Closed alexlancaster closed 7 years ago

alexlancaster commented 7 years ago

Transferring from issue #4 comment https://github.com/alexlancaster/pypop/issues/4#issuecomment-313516210 originally by @sjmack:

However, I have constructed a test data file (the controls from the BIGDAWG synthetic datafile) that reveals several issues with the current HW implementations (vs version 0.7.0).

I'm attaching two versions of this test file: BIGDAWG_SynthControl_Data.pop.txt and BIGDAWG_SynthControl_Data_dash.pop.txt And the associated .ini file: WS_BDCtrl_Test_HW.ini.txt Be sure to remove the .txt suffices.

The difference between the two datasets is that the _dash.pop file has the colons converted to dashes. I did this so that I could compare the current developmental version of PyPop on my Mac to v0.7.0 running on my PC. I could only run the _dash.pop file on my PC.

I have three set of results. The git. versions were generated using this development version of PyPop, and the 070. versions with the current release version.

First of all, you will notice that there are no Common Genotypes being generated with the development version. The results below show the dash datasets, but the same happens when colons are included for the developmental version.

Compare (Git):

------------------------------------------------------------------------------
Common heterozygotes by allele
       01-01-01-01         152      152.25        0.00        0.9839      
       02-01-01-01          56       56.32        0.00        0.9658      
          02-05-01         132      131.94        0.00        0.9957      
       03-01-01-01         135      134.51        0.00        0.9662      
          03-01-03         102      102.18        0.00        0.9858      
       11-01-01-01          57       57.26        0.00        0.9723      
       11-01-01-02          57       57.26        0.00        0.9723      
          23-01-01          43       43.99        0.02        0.8814      
       24-02-01-01         145      144.70        0.00        0.9801      
          24-02-04          56       56.32        0.00        0.9658      
          25-01-01          38       37.28        0.01        0.9061      
          26-01-01          92       91.40        0.00        0.9501      
             26-08         105      104.85        0.00        0.9885      
       29-01-01-01          56       56.32        0.00        0.9658      
       29-02-01-02         102      102.18        0.00        0.9858      
       31-01-02-01          92       91.40        0.00        0.9501      
       31-01-02-02           9        8.96        0.00        0.9892      
          32-01-01          30       29.55        0.01        0.9342      
             32-02         183      182.44        0.00        0.9667      
          33-01-01           9        8.96        0.00        0.9892      
          68-01-01          76       76.81        0.01        0.9267      
             68-06         155      154.75        0.00        0.9838      

------------------------------------------------------------------------------
Common genotypes
             Total           0        0.00
------------------------------------------------------------------------------

With (v0.7.0):

------------------------------------------------------------------------------
Common heterozygotes by allele
       01-01-01-01         152      152.25        0.00        0.9839      
       02-01-01-01          56       56.32        0.00        0.9658      
          02-05-01         132      131.94        0.00        0.9957      
       03-01-01-01         135      134.51        0.00        0.9662      
          03-01-03         102      102.18        0.00        0.9858      
       11-01-01-01          57       57.26        0.00        0.9723      
       11-01-01-02          57       57.26        0.00        0.9723      
          23-01-01          43       43.99        0.02        0.8814      
       24-02-01-01         145      144.70        0.00        0.9801      
          24-02-04          56       56.32        0.00        0.9658      
          25-01-01          38       37.28        0.01        0.9061      
          26-01-01          92       91.40        0.00        0.9501      
             26-08         105      104.85        0.00        0.9885      
       29-01-01-01          56       56.32        0.00        0.9658      
       29-02-01-02         102      102.18        0.00        0.9858      
       31-01-02-01          92       91.40        0.00        0.9501      
       31-01-02-02           9        8.96        0.00        0.9892      
          32-01-01          30       29.55        0.01        0.9342      
             32-02         183      182.44        0.00        0.9667      
          33-01-01           9        8.96        0.00        0.9892      
          68-01-01          76       76.81        0.01        0.9267      
             68-06         155      154.75        0.00        0.9838      

------------------------------------------------------------------------------
Common genotypes
-01-01:01-01-01-01           7        6.88        0.00        0.9621      
-01-01-01:02-05-01          11       11.76        0.05        0.8241      
-01-01:03-01-01-01          12       12.01        0.00        0.9975      
-01-01-01:03-01-03           9        8.95        0.00        0.9856      
-01-01:24-02-01-01          13       13.00        0.00        0.9989      
-01-01-01:26-01-01           8        7.95        0.00        0.9864      
 01-01-01-01:26-08           9        9.19        0.00        0.9488      
-01-01:29-02-01-02           9        8.95        0.00        0.9856      
-01-01:31-01-02-01           8        7.95        0.00        0.9864      
 01-01-01-01:32-02          16       16.82        0.04        0.8424      
-01-01-01:68-01-01           7        6.63        0.02        0.8847      
 01-01-01-01:68-06          14       14.00        0.00        0.9998      
 02-01-01-01:32-02           6        5.88        0.00        0.9590      
 02-05-01:02-05-01           5        5.03        0.00        0.9890      
-05-01:03-01-01-01          10       10.27        0.01        0.9318      
 02-05-01:03-01-03           8        7.65        0.02        0.9001      
-05-01:24-02-01-01          11       11.12        0.00        0.9702      
 02-05-01:26-01-01           7        6.80        0.01        0.9396      
    02-05-01:26-08           8        7.87        0.00        0.9617      
-05-01:29-02-01-02           8        7.65        0.02        0.9001      
-05-01:31-01-02-01           7        6.80        0.01        0.9396      
    02-05-01:32-02          14       14.38        0.01        0.9193      
 02-05-01:68-01-01           6        5.67        0.02        0.8893      
    02-05-01:68-06          12       11.98        0.00        0.9942      
-01-01:03-01-01-01           5        5.25        0.01        0.9146      
-01-01-01:03-01-03           8        7.81        0.00        0.9471      
-01-01:24-02-01-01          12       11.36        0.04        0.8493      
-01-01-01:26-01-01           7        6.95        0.00        0.9837      
 03-01-01-01:26-08           8        8.03        0.00        0.9911      
-01-01:29-02-01-02           8        7.81        0.00        0.9471      
-01-01:31-01-02-01           7        6.95        0.00        0.9837      
 03-01-01-01:32-02          15       14.69        0.01        0.9351      
-01-01-01:68-01-01           6        5.79        0.01        0.9299      
 03-01-01-01:68-06          12       12.23        0.00        0.9480      
-01-03:24-02-01-01           8        8.46        0.03        0.8741      
 03-01-03:26-01-01           5        5.17        0.01        0.9391      
    03-01-03:26-08           6        5.98        0.00        0.9941      
-01-03:29-02-01-02           6        5.82        0.01        0.9406      
-01-03:31-01-02-01           5        5.17        0.01        0.9391      
    03-01-03:32-02          11       10.94        0.00        0.9856      
    03-01-03:68-06           9        9.11        0.00        0.9715      
 11-01-01-01:32-02           6        5.98        0.00        0.9923      
 11-01-01-02:32-02           6        5.98        0.00        0.9923      
-01-01:24-02-01-01           6        6.15        0.00        0.9518      
-02-01-01:26-01-01           8        7.52        0.03        0.8613      
 24-02-01-01:26-08           9        8.70        0.01        0.9179      
-01-01:29-02-01-02           8        8.46        0.03        0.8741      
-01-01:31-01-02-01           8        7.52        0.03        0.8613      
 24-02-01-01:32-02          16       15.90        0.00        0.9807      
-02-01-01:68-01-01           6        6.27        0.01        0.9149      
 24-02-01-01:68-06          13       13.24        0.00        0.9474      
    24-02-04:32-02           6        5.88        0.00        0.9590      
    26-01-01:26-08           5        5.32        0.02        0.8905      
-01-01:29-02-01-02           5        5.17        0.01        0.9391      
    26-01-01:32-02          10        9.72        0.01        0.9296      
    26-01-01:68-06           8        8.10        0.00        0.9731      
 26-08:29-02-01-02           6        5.98        0.00        0.9941      
 26-08:31-01-02-01           5        5.32        0.02        0.8905      
       26-08:32-02          11       11.24        0.01        0.9420      
       26-08:68-06           9        9.36        0.01        0.9061      
 29-01-01-01:32-02           6        5.88        0.00        0.9590      
-01-02:31-01-02-01           5        5.17        0.01        0.9391      
 29-02-01-02:32-02          11       10.94        0.00        0.9856      
 29-02-01-02:68-06           9        9.11        0.00        0.9715      
 31-01-02-01:32-02          10        9.72        0.01        0.9296      
 31-01-02-01:68-06           8        8.10        0.00        0.9731      
       32-02:32-02          10       10.28        0.01        0.9300      
    32-02:68-01-01           8        8.10        0.00        0.9709      
       32-02:68-06          17       17.12        0.00        0.9770      
    68-01-01:68-06           7        6.75        0.01        0.9223      
       68-06:68-06           7        7.13        0.00        0.9624      
             Total         612      612.81
------------------------------------------------------------------------------

In addition, the stats being reported for the developmental version include errors; especially for the Chen and Diff tests, where obs and exp values are 0. The differences in the p-values for the mcmc results probably stem from the Markov-Chain, but I only did each one once, so I'm not certain.

Compare (Git):

3.3. Guo and Thompson HardyWeinberg output (mcmc) [DQB1]
--------------------------------------------------------
Total steps in MCMC: 1000000
Dememorization steps: 2000
Number of Markov chain samples: 1000
Markov chain sample size: 1000
Std. error:       0
p-value (overall): 0.0000*****

Individual genotype p-values found to be significant
Genotype (observed/expected) [Chen's pval] [diff pval]
03-02-12+03-02-12 (0/0.000000) 0.0007*** 0.0007***
05-03-01-01+03-02-12 (0/0.000000) 0.0000***** 0.0000*****
05-03-01-01+05-03-01-01 (0/0.000000) 0.0000***** 0.0000*****
06-02-01+05-03-01-01 (0/0.000000) 0.0000***** 0.0000*****
06-02-01+06-02-01 (0/0.000000) 0.0001*** 0.0001***
06-03-01+04-02-01 (0/0.000000) 0.0000***** 0.0000*****
06-05-01+05-03-01-01 (0/0.000000) 0.0000**** 0.0000*****
06-05-01+06-05-01 (0/0.000000) 0.0049** 0.0049**

3.4. Guo and Thompson HardyWeinberg output (monte-carlo) [DQB1]
---------------------------------------------------------------
Steps in Monte-Carlo randomization: 100000
p-value (overall): 0.0000*****

Individual genotype p-values found to be significant
Genotype (observed/expected) [Chen's pval] [diff pval]
03-02-12+03-02-12 (0/0.000000) 0.0017** 0.0017**
05-03-01-01+03-02-12 (0/0.000000) 0.0000***** 0.0000*****
05-03-01-01+05-03-01-01 (0/0.000000) 0.0000***** 0.0000*****
06-02-01+05-03-01-01 (0/0.000000) 0.0000***** 0.0000*****
06-02-01+06-02-01 (0/0.000000) 0.0007*** 0.0007***
06-03-01+04-02-01 (0/0.000000) 0.0000***** 0.0000*****
06-05-01+05-03-01-01 (0/0.000000) 0.0000**** 0.0000****
06-05-01+06-05-01 (0/0.000000) 0.0016** 0.0016**

to (version 0.7.0):

3.3. Guo and Thompson HardyWeinberg output (mcmc) [DQB1]
--------------------------------------------------------
Total steps in MCMC: 1000000
Dememorization steps: 2000
Number of Markov chain samples: 1000
Markov chain sample size: 1000
Std. error:       0
p-value (overall): 0.0000*****

Individual genotype p-values found to be significant
Genotype (observed/expected) [Chen's pval] [diff pval]
03-02-12:03-02-12:  (18/8.818363) 0.0017** 0.0017**
05-03-01-01:03-02-12:  (0/18.105788) 0.0000***** 0.0000*****
05-03-01-01:05-03-01-01:  (39/9.293663) 0.0000***** 0.0000*****
06-02-01:05-03-01-01:  (0/23.499002) 0.0000***** 0.0000*****
06-02-01:06-02-01:  (27/14.854291) 0.0000***** 0.0000*****
06-03-01:04-02-01:  (8/0.287425) 0.0000***** 0.0000*****
06-05-01:05-03-01-01:  (0/18.105788) 0.0000***** 0.0000*****
06-05-01:06-05-01:  (18/8.818363) 0.0011** 0.0011**

3.4. Guo and Thompson HardyWeinberg output (monte-carlo) [DQB1]
---------------------------------------------------------------
Steps in Monte-Carlo randomization: 100000
p-value (overall): 0.0000*****

Individual genotype p-values found to be significant
Genotype (observed/expected) [Chen's pval] [diff pval]
03-02-12:03-02-12:  (18/8.818363) 0.0017** 0.0017**
05-03-01-01:03-02-12:  (0/18.105788) 0.0000***** 0.0000*****
05-03-01-01:05-03-01-01:  (39/9.293663) 0.0000***** 0.0000*****
06-02-01:05-03-01-01:  (0/23.499002) 0.0000***** 0.0000*****
06-02-01:06-02-01:  (27/14.854291) 0.0007*** 0.0007***
06-03-01:04-02-01:  (8/0.287425) 0.0000***** 0.0000*****
06-05-01:05-03-01-01:  (0/18.105788) 0.0000**** 0.0000****
06-05-01:06-05-01:  (18/8.818363) 0.0016** 0.0016**

Here are the results:

BIGDAWG_SynthControl_Data-out.git.txt BIGDAWG_SynthControl_Data-out.git.xml.txt BIGDAWG_SynthControl_Data_dash-out.git.txt BIGDAWG_SynthControl_Data_dash-out.git.xml.txt BIGDAWG_SynthControl_Data_dash-out.070.txt BIGDAWG_SynthControl_Data_dash-out.070.xml.txt

Again, remove the .txt from the .XML filenames.

alexlancaster commented 7 years ago

@sjmack: the problem had a common origin: generating the genotype output table, the fix I just pushed to master (i.e. just git pull and ./setup.py build again) should fix both the common genotype as well as the output in the Guo & Thompson. I added one of your files (the non-dash one) to the test suite (can you re-run py.test -s -v?). Is it worth also adding the dash version as well?

In any case, good catch! more test data files like this will improve our overall test coverage, and catch more issues like this. I would encourage @kosoegawa and others to please open up issues and add files like this.

sjmack commented 7 years ago

The dash version is only useful for testing against older versions of PyPop, but that is necessary for validation, so ....

I'll run some tests!

alexlancaster commented 7 years ago

I'll add to the test suite as well then, since it should handle dashes as well as colons. Right now, we would have problems only if you used the genotype separator (which I've hardcoded as a tilde ~) within the allele identifier. Eventually even this should be removed (see #14) so there would be no "special" character that you would have to avoid, but this would be a more major internal architectural change.

alexlancaster commented 7 years ago

@sjmack is this issue fixed from your POV? if so, I'll close.

sjmack commented 7 years ago

The missing common genotypes and incorrect stats issues are resolved; however Issue #19 discusses an additional issue with the common genotypes (which I should have noted using the *_dash.pop version of the data, but didn't).

py.test -s -v also shows some fails as py.test did earlier.

:pypop sjmack$ py.test -s -v
================================================================= test session starts ==================================================================
platform darwin -- Python 2.7.13, pytest-3.1.2, py-1.4.34, pluggy-0.4.0 -- /opt/local/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python
cachedir: .cache
rootdir: /Applications/PyPop/pypop, inifile:
collected 5 items 

tests/test_AlleleColon.py::test_AlleleColon_HardyWeinberg PASSED
tests/test_AlleleColon.py::test_AlleleColon_Emhaplofreq PASSED
tests/test_GenotypeCommon.py::test_GenotypeCommon_HardyWeinberg FAILED
tests/test_GenotypeCommon.py::test_GenotypeCommonDash_HardyWeinberg FAILED
tests/test_gthwe.py::test_gthwe SKIPPED

======================================================================= FAILURES =======================================================================
__________________________________________________________ test_GenotypeCommon_HardyWeinberg ___________________________________________________________

    def test_GenotypeCommon_HardyWeinberg():
        exit_code = base.run_pypop_process('./tests/data/WS_BDCtrl_Test_HW.ini', './tests/data/BIGDAWG_SynthControl_Data.pop')
        # check exit code
        assert exit_code == 0
        # compare with md5sum of output file
>       assert hashlib.md5(open("BIGDAWG_SynthControl_Data-out.txt", 'rb').read()).hexdigest() == '276263b0d0d9fc03b77826388d70510d'
E       AssertionError: assert 'c0d952cbc16e...3c849ea4e2095' == '276263b0d0d9f...826388d70510d'
E         - c0d952cbc16e90f5bd03c849ea4e2095
E         + 276263b0d0d9fc03b77826388d70510d

tests/test_GenotypeCommon.py:11: AssertionError
________________________________________________________ test_GenotypeCommonDash_HardyWeinberg _________________________________________________________

    def test_GenotypeCommonDash_HardyWeinberg():
        exit_code = base.run_pypop_process('./tests/data/WS_BDCtrl_Test_HW.ini', './tests/data/BIGDAWG_SynthControl_Data_dash.pop')
        # check exit code
        assert exit_code == 0
        # compare with md5sum of output file
>       assert hashlib.md5(open("BIGDAWG_SynthControl_Data_dash-out.txt", 'rb').read()).hexdigest() == 'b0f4247a2a67a65d0109b6448427cc28'
E       AssertionError: assert '93ff300c62a3...bd001ce8c592b' == 'b0f4247a2a67a...9b6448427cc28'
E         - 93ff300c62a3056c4cdbd001ce8c592b
E         + b0f4247a2a67a65d0109b6448427cc28

tests/test_GenotypeCommon.py:18: AssertionError
=================================================== 2 failed, 2 passed, 1 skipped in 157.40 seconds ====================================================
alexlancaster commented 7 years ago

I'll respin the tests as per my comment in #4 and if that works, I'll close this particular issue.

alexlancaster commented 7 years ago

Hi @sjmack, let me know if the tests look OK and I'll close.

sjmack commented 7 years ago

py.test: 11 passed, 1 skipped in 75.75 seconds.

Looks good.