alexlancaster / pypop

PyPop: Python for Population Genomics

http://pypop.org

GNU General Public License v2.0

22 stars 15 forks source link

Use haplo.stats to estimate haplotypes #28

Closed alexlancaster closed 7 years ago

alexlancaster commented 7 years ago

As an alternative to emhaplofreq which has limitations of number of loci and number of individuals @rsingle and I are working on the haplo-stats branch: https://github.com/alexlancaster/pypop/tree/haplo-stats to wrap the C module haplo_em_pin for estimating haplotypes (originally part of the R package haplo.stats and licensed under the GPL). @sjmack has previously tested haplo.stats outside the PyPop context.

alexlancaster commented 7 years ago

@sjmack the haplo-stats branch is ready for some initial testing:

  git pull
  git checkout haplo-stats
  git pull
  ./setup.py build

You can run one of the tests:

 py.test -s -v tests/test_Haplostats.py

This will print some diagnostic information with output etc.

Caveats: this is not yet integrated with the .ini file or the txt/XML output, this is very far from being finished. There is quite a bit of work to go make sure all the measures line up and computing the correct haplotype names and frequency information. The way that haplo-stats calculates/represents haplotypes is quite different to emhaplofreq. But this prototype shows that, in principle, we can get all the information we need at least for haplotype frequency (LD is a different matter, @rsingle can maybe comment on that). Testing very large data files is yet another step beyond that again.

sjmack commented 7 years ago

Heres what I got. That looks like haplo.stats output!

================================================================== test session starts ===================================================================
platform darwin -- Python 2.7.13, pytest-3.1.2, py-1.4.34, pluggy-0.4.0 -- /opt/local/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python
cachedir: .cache
rootdir: /Applications/PyPop/pypop, inifile:
collected 3 items 

tests/test_Haplostats.py::test_Haplostats_Simple START inside C program
n_loci: 2
i hap_prob[i]   u_hap_code[i]   k   u_hap[k]
 0   0.05000         0          0    1    2 
 1   0.05000         1          2    1    7 
 2   0.05000         2          4    2    1 
 3   0.05000         3          6    2    3 
 4   0.05000         4          8    3    6 
 5   0.05000         5         10    3    7 
 6   0.15000         6         12    4    1 
 7   0.15000         7         14    4    3 
 8   0.05000         8         16    5    4 
 9   0.05000         9         18    5    5 
10   0.05000        10         20    6    4 
11   0.05000        11         22    6    5 
12   0.05000        12         24    6    6 
13   0.05000        13         26    6    7 
14   0.05000        14         28    7    2 
15   0.05000        15         30    7    7 
hap1   hap2
  5   12
  4   13
  7    2
  3    6
  0   15
  1   14
  7    6
  8   11
 10    9
END inside C program
PASSED
tests/test_Haplostats.py::test_Haplostats_Simple3 START inside C program
n_loci: 3
i hap_prob[i]   u_hap_code[i]   k   u_hap[k]
 0   0.11000         7          0    1    3    2 
 1   0.01000        11          3    1    3    9 
 2   0.01000        14          6    1    3   20 
 3   0.00500        17          9    1    5    3 
 4   0.04000        19         12    1    5    9 
 5   0.00500        20         15    1    5   14 
 6   0.01000        27         18    2    1    5 
 7   0.06000        37         21    2    4    9 
 8   0.02000        40         24    2    4   17 
 9   0.02000        42         27    2    5    1 
10   0.01000        46         30    2    5   17 
11   0.00250        47         33    2    6   12 
12   0.00250        48         36    2    6   13 
13   0.01000        50         39    2    6   18 
14   0.01000        53         42    2    8    5 
15   0.01000        54         45    2    8    6 
16   0.02000        55         48    2    8    7 
17   0.01000        57         51    2    8   11 
18   0.00250        58         54    2    8   12 
19   0.00250        59         57    2    8   13 
20   0.01000        64         60    2    9    9 
21   0.01000        73         63    2   10   19 
22   0.01000        85         66    3    4   10 
23   0.01000        86         69    3    4   12 
24   0.01000        87         72    3    4   13 
25   0.05000        88         75    3    4   16 
26   0.01000        94         78    3    8   18 
27   0.00500        99         81    4    5    3 
28   0.00500       100         84    4    5   14 
29   0.01000       103         87    4   10   15 
30   0.00500       106         90    5    4    6 
31   0.00500       107         93    5    4   16 
32   0.00000       108         96    5    6    1 
33   0.01000       109         99    5    6    4 
34   0.00500       110        102    5    6    6 
35   0.00250       111        105    5    6   12 
36   0.00250       112        108    5    6   13 
37   0.00500       113        111    5    6   16 
38   0.00250       114        114    5    8   12 
39   0.00250       115        117    5    8   13 
40   0.01000       119        120    6    1    5 
41   0.04000       120        123    6    1    6 
42   0.01000       121        126    6    1   12 
43   0.01000       122        129    6    1   15 
44   0.01000       123        132    6    1   16 
45   0.01000       132        135    6    7    7 
46   0.01000       136        138    7    2    5 
47   0.01000       143        141    8   10   13 
48   0.01000       144        144    9    8    9 
49   0.09000       149        147   10    2    1 
50   0.03000       151        150   10    2    4 
51   0.01000       152        153   10    2    5 
52   0.01000       153        156   10    2    6 
53   0.02000       155        159   10    2    9 
54   0.01000       177        162   10    6    6 
55   0.01000       184        165   10    9   12 
56   0.01000       194        168   11    2    6 
57   0.01000       205        171   11    9    1 
58   0.01000       208        174   11    9    8 
59   0.03000       209        177   11    9    9 
60   0.02000       211        180   11    9   13 
61   0.03000       214        183   11    9   18 
62   0.00000       215        186   11    9   20 
63   0.01000       218        189   12    2    1 
64   0.01000       227        192   12    9    1 
65   0.01000       229        195   12    9   14 
66   0.01000       230        198   12    9   16 
67   0.01000       232        201   12    9   19 
hap1   hap2
 40   94
149   19
 27  214
 42   19
 47  115
 58  112
 48  114
 59  111
  7   19
232  149
144   64
149  194
119  120
 37   88
152   87
214  209
 88  149
205  121
120  184
 46  211
 73  149
149  109
108  151
 40  230
136   37
  7   11
  7  123
106  113
110  107
218   86
227  120
214   55
229  153
 53    7
 57  149
 42  103
208   88
120  151
177    7
 54   55
 88  211
149  132
209   37
 37   37
209   14
 11  215
155  149
 17  100
 20   99
  7    7
 85  155
 50   88
 19  151
  7    7
  7  122
151    7
 37  143
END inside C program
PASSED
tests/test_Haplostats.py::test_Haplostats_PyPopStringMatrix START inside C program
n_loci: 2
i hap_prob[i]   u_hap_code[i]   k   u_hap[k]
 0   0.05000         0          0    1    2 
 1   0.05000         1          2    1    7 
 2   0.05000         2          4    2    1 
 3   0.05000         3          6    2    3 
 4   0.05000         4          8    3    6 
 5   0.05000         5         10    3    7 
 6   0.15000         6         12    4    1 
 7   0.15000         7         14    4    3 
 8   0.05000         8         16    5    4 
 9   0.05000         9         18    5    5 
10   0.05000        10         20    6    4 
11   0.05000        11         22    6    5 
12   0.05000        12         24    6    6 
13   0.05000        13         26    6    7 
14   0.05000        14         28    7    2 
15   0.05000        15         30    7    7 
hap1   hap2
  5   12
  4   13
  7    2
  3    6
  0   15
  1   14
  7    6
  8   11
 10    9
END inside C program
PASSED

================================================================ 3 passed in 1.78 seconds ================================================================

alexlancaster commented 7 years ago

Excellent, so we know now it works on MacOS as well.

alexlancaster commented 7 years ago

@sjmack try a new git pull in the haplo-stats branch and try the same test. You can also try ./haplo-stats/haplo-test.py which more or less does the same thing. We can now reconstruct the haplotypes using the original identifiers.

If you have the R haplo.stats package installed, you should be able to reproduce the same values with the following commands:

library(haplo.stats)
control = haplo.em.control(n.try=1)
data(hla.demo)
attach(hla.demo)
geno = hla.demo[1:5,c(21:24)]
label <-c("DRB","B")
save.em <- haplo.em(geno=geno, locus.label=label, control=control)

The haplotypes are in the save.em$haplotype variable (and other similar variables).

sjmack commented 7 years ago

Okay. here are the results of py.test -s -v tests/test_Haplostats.py

================================================================== test session starts ===================================================================
platform darwin -- Python 2.7.13, pytest-3.1.2, py-1.4.34, pluggy-0.4.0 -- /opt/local/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python
cachedir: .cache
rootdir: /Applications/PyPop/pypop, inifile:
collected 3 items 

tests/test_Haplostats.py::test_Haplostats_Simple START inside C program
n_loci: 2
i hap_prob[i]   u_hap_code[i]   k   u_hap[k]
 0   0.05000         0          0    1    2 
 1   0.05000         1          2    1    7 
 2   0.05000         2          4    2    1 
 3   0.05000         3          6    2    3 
 4   0.05000         4          8    3    6 
 5   0.05000         5         10    3    7 
 6   0.15000         6         12    4    1 
 7   0.15000         7         14    4    3 
 8   0.05000         8         16    5    4 
 9   0.05000         9         18    5    5 
10   0.05000        10         20    6    4 
11   0.05000        11         22    6    5 
12   0.05000        12         24    6    6 
13   0.05000        13         26    6    7 
14   0.05000        14         28    7    2 
15   0.05000        15         30    7    7 
hap1   hap2
  5   12
  4   13
  7    2
  3    6
  0   15
  1   14
  7    6
  8   11
 10    9
END inside C program
PASSED
tests/test_Haplostats.py::test_Haplostats_Simple3 START inside C program
n_loci: 3
i hap_prob[i]   u_hap_code[i]   k   u_hap[k]
 0   0.11000         7          0    1    3    2 
 1   0.01000        11          3    1    3    9 
 2   0.01000        14          6    1    3   20 
 3   0.00500        17          9    1    5    3 
 4   0.04000        19         12    1    5    9 
 5   0.00500        20         15    1    5   14 
 6   0.01000        27         18    2    1    5 
 7   0.06000        37         21    2    4    9 
 8   0.02000        40         24    2    4   17 
 9   0.02000        42         27    2    5    1 
10   0.01000        46         30    2    5   17 
11   0.00250        47         33    2    6   12 
12   0.00250        48         36    2    6   13 
13   0.01000        50         39    2    6   18 
14   0.01000        53         42    2    8    5 
15   0.01000        54         45    2    8    6 
16   0.02000        55         48    2    8    7 
17   0.01000        57         51    2    8   11 
18   0.00250        58         54    2    8   12 
19   0.00250        59         57    2    8   13 
20   0.01000        64         60    2    9    9 
21   0.01000        73         63    2   10   19 
22   0.01000        85         66    3    4   10 
23   0.01000        86         69    3    4   12 
24   0.01000        87         72    3    4   13 
25   0.05000        88         75    3    4   16 
26   0.01000        94         78    3    8   18 
27   0.00500        99         81    4    5    3 
28   0.00500       100         84    4    5   14 
29   0.01000       103         87    4   10   15 
30   0.00500       106         90    5    4    6 
31   0.00500       107         93    5    4   16 
32   0.00000       108         96    5    6    1 
33   0.01000       109         99    5    6    4 
34   0.00500       110        102    5    6    6 
35   0.00250       111        105    5    6   12 
36   0.00250       112        108    5    6   13 
37   0.00500       113        111    5    6   16 
38   0.00250       114        114    5    8   12 
39   0.00250       115        117    5    8   13 
40   0.01000       119        120    6    1    5 
41   0.04000       120        123    6    1    6 
42   0.01000       121        126    6    1   12 
43   0.01000       122        129    6    1   15 
44   0.01000       123        132    6    1   16 
45   0.01000       132        135    6    7    7 
46   0.01000       136        138    7    2    5 
47   0.01000       143        141    8   10   13 
48   0.01000       144        144    9    8    9 
49   0.09000       149        147   10    2    1 
50   0.03000       151        150   10    2    4 
51   0.01000       152        153   10    2    5 
52   0.01000       153        156   10    2    6 
53   0.02000       155        159   10    2    9 
54   0.01000       177        162   10    6    6 
55   0.01000       184        165   10    9   12 
56   0.01000       194        168   11    2    6 
57   0.01000       205        171   11    9    1 
58   0.01000       208        174   11    9    8 
59   0.03000       209        177   11    9    9 
60   0.02000       211        180   11    9   13 
61   0.03000       214        183   11    9   18 
62   0.00000       215        186   11    9   20 
63   0.01000       218        189   12    2    1 
64   0.01000       227        192   12    9    1 
65   0.01000       229        195   12    9   14 
66   0.01000       230        198   12    9   16 
67   0.01000       232        201   12    9   19 
hap1   hap2
 40   94
149   19
 27  214
 42   19
 47  115
 58  112
 48  114
 59  111
  7   19
232  149
144   64
149  194
119  120
 37   88
152   87
214  209
 88  149
205  121
120  184
 46  211
 73  149
149  109
108  151
 40  230
136   37
  7   11
  7  123
106  113
110  107
218   86
227  120
214   55
229  153
 53    7
 57  149
 42  103
208   88
120  151
177    7
 54   55
 88  211
149  132
209   37
 37   37
209   14
 11  215
155  149
 17  100
 20   99
  7    7
 85  155
 50   88
 19  151
  7    7
  7  122
151    7
 37  143
END inside C program
PASSED
tests/test_Haplostats.py::test_Haplostats_PyPopStringMatrix unique_alleles: ['1', '2', '4', '7', '8', '11', '13']
unique_alleles: ['7', '27', '44', '51', '55', '61', '62']
START inside C program
n_loci: 2
i hap_prob[i]   u_hap_code[i]   k   u_hap[k]
 0   0.05000         0          0    1    2 
 1   0.05000         1          2    1    7 
 2   0.05000         2          4    2    1 
 3   0.05000         3          6    2    3 
 4   0.05000         4          8    3    6 
 5   0.05000         5         10    3    7 
 6   0.15000         6         12    4    1 
 7   0.15000         7         14    4    3 
 8   0.05000         8         16    5    4 
 9   0.05000         9         18    5    5 
10   0.05000        10         20    6    4 
11   0.05000        11         22    6    5 
12   0.05000        12         24    6    6 
13   0.05000        13         26    6    7 
14   0.05000        14         28    7    2 
15   0.05000        15         30    7    7 
hap1   hap2
  5   12
  4   13
  7    2
  3    6
  0   15
  1   14
  7    6
  8   11
 10    9
END inside C program
[['1' '27']
 ['1' '62']
 ['2' '7']
 ['2' '44']
 ['4' '61']
 ['4' '62']
 ['7' '7']
 ['7' '44']
 ['8' '51']
 ['8' '55']
 ['11' '51']
 ['11' '55']
 ['11' '61']
 ['11' '62']
 ['13' '27']
 ['13' '62']]
[[ 1  6 13]
 [ 1  5 14]
 [ 2  8  3]
 [ 2  4  7]
 [ 3  1 16]
 [ 3  2 15]
 [ 4  8  7]
 [ 5  9 12]
 [ 5 11 10]]
FAILED

======================================================================== FAILURES ========================================================================
___________________________________________________________ test_Haplostats_PyPopStringMatrix ____________________________________________________________

    def test_Haplostats_PyPopStringMatrix():
        """
        This is the same numerical example as test_Haplostats_Simple()
        except we are setting up via PyPop StringMatrix, and letting the class
        handle all the translation into the low-level variables for the wrapper
        """

        from PyPop.Utils import StringMatrix
        from PyPop.Haplo import Haplostats
        import numpy
        import numpy.testing

        control = {'max_iter': 5000,
                   'min_posterior': 0.000000001,
                   'tol': 0.00001,
                   'insert_batch_size': 2,
                   'random_start': 0,
                   'verbose': 0,
                   'max_haps_limit': 10000 }

        geno = StringMatrix(5, ["DRB", "B"])
        geno[0, 'DRB'] = ('4', '11')
        geno[1, 'DRB'] = ('2', '7')
        geno[2, 'DRB'] = ('1', '13')
        geno[3, 'DRB'] = ('7', '7')
        geno[4, 'DRB'] = ('8', '11')
        geno[0, 'B'] = ('62', '61')
        geno[1, 'B'] = ('7', '44')
        geno[2, 'B'] = ('27', '62')
        geno[3, 'B'] = ('7', '44')
        geno[4, 'B'] = ('51', '55')

        haplo = Haplostats(geno)
        converge, lnlike, n_u_hap, n_hap_pairs, hap_prob, u_hap, u_hap_code, subj_id, post, hap1_code, hap2_code, haplotype = \
                  haplo.estHaplotypes(weight=None, control=control)

        assert converge == 1
        assert lnlike == -20.42316124449607
        assert n_u_hap == 16
        assert n_hap_pairs == 9

        assert hap_prob == [0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.15, 0.15, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05]
        assert u_hap == [1, 2, 1, 7, 2, 1, 2, 3, 3, 6, 3, 7, 4, 1, 4, 3, 5, 4, 5, 5, 6, 4, 6, 5, 6, 6, 6, 7, 7, 2, 7, 7]
        assert u_hap_code == [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
        assert subj_id == [1, 1, 2, 2, 3, 3, 4, 5, 5]
        assert post == [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1.0, 0.5, 0.5]
>       assert hap1_code == [6, 5, 3, 4, 1, 2, 7, 9, 10]
E       assert [6, 5, 8, 4, 1, 2, ...] == [6, 5, 3, 4, 1, 2, ...]
E         At index 2 diff: 8 != 3
E         Full diff:
E         - [6, 5, 8, 4, 1, 2, 8, 9, 11]
E         ?        ^           ^      ^
E         + [6, 5, 3, 4, 1, 2, 7, 9, 10]
E         ?        ^           ^      ^

tests/test_Haplostats.py:175: AssertionError
=========================================================== 1 failed, 2 passed in 1.33 seconds ===========================================================

Here are the results of ./haplo-stats/haplo-test.py

unique_alleles: ['1', '2', '4', '7', '8', '11', '13']
unique_alleles: ['7', '27', '44', '51', '55', '61', '62']
START inside C program
n_loci: 2
i hap_prob[i]   u_hap_code[i]   k   u_hap[k]
 0   0.05000         0          0    1    2 
 1   0.05000         1          2    1    7 
 2   0.05000         2          4    2    1 
 3   0.05000         3          6    2    3 
 4   0.05000         4          8    3    6 
 5   0.05000         5         10    3    7 
 6   0.15000         6         12    4    1 
 7   0.15000         7         14    4    3 
 8   0.05000         8         16    5    4 
 9   0.05000         9         18    5    5 
10   0.05000        10         20    6    4 
11   0.05000        11         22    6    5 
12   0.05000        12         24    6    6 
13   0.05000        13         26    6    7 
14   0.05000        14         28    7    2 
15   0.05000        15         30    7    7 
hap1   hap2
  5   12
  4   13
  7    2
  3    6
  0   15
  1   14
  7    6
  8   11
 10    9
END inside C program
[['1' '27']
 ['1' '62']
 ['2' '7']
 ['2' '44']
 ['4' '61']
 ['4' '62']
 ['7' '7']
 ['7' '44']
 ['8' '51']
 ['8' '55']
 ['11' '51']
 ['11' '55']
 ['11' '61']
 ['11' '62']
 ['13' '27']
 ['13' '62']]
[[ 1  6 13]
 [ 1  5 14]
 [ 2  8  3]
 [ 2  4  7]
 [ 3  1 16]
 [ 3  2 15]
 [ 4  8  7]
 [ 5  9 12]
 [ 5 11 10]]
 converge: 1
 lnlike: -20.4231612445
 n_u_hap: 16
 n_hap_pairs: 9
 hap_prob: [0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.15, 0.15, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05]
 u_hap: [1, 2, 1, 7, 2, 1, 2, 3, 3, 6, 3, 7, 4, 1, 4, 3, 5, 4, 5, 5, 6, 4, 6, 5, 6, 6, 6, 7, 7, 2, 7, 7]
 u_hap_code: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
 subj_id: [1, 1, 2, 2, 3, 3, 4, 5, 5]
 post: [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1.0, 0.5, 0.5]
 hap1_code: [6, 5, 8, 4, 1, 2, 8, 9, 11]
 hap2_code: [13, 14, 3, 7, 16, 15, 7, 12, 10]
hap_prob  u_hap_code u_hap(needs to be split for printing)
[[  0.05   1.  ]
 [  0.05   2.  ]
 [  0.05   3.  ]
 [  0.05   4.  ]
 [  0.05   5.  ]
 [  0.05   6.  ]
 [  0.15   7.  ]
 [  0.15   8.  ]
 [  0.05   9.  ]
 [  0.05  10.  ]
 [  0.05  11.  ]
 [  0.05  12.  ]
 [  0.05  13.  ]
 [  0.05  14.  ]
 [  0.05  15.  ]
 [  0.05  16.  ]]
subj_id  hap1_code  hap2_code
[[ 1  6 13]
 [ 1  5 14]
 [ 2  8  3]
 [ 2  4  7]
 [ 3  1 16]
 [ 3  2 15]
 [ 4  8  7]
 [ 5  9 12]
 [ 5 11 10]]

That looks the same as save.em in R

> save.em$haplotype
   DRB  B
1    1 27
2    1 62
3    2  7
4    2 44
5    4 61
6    4 62
7    7  7
8    7 44
9    8 51
10   8 55
11  11 51
12  11 55
13  11 61
14  11 62
15  13 27
16  13 62
> save.em$subj.id
[1] 1 1 2 2 3 3 4 5 5
> save.em$hap1code
[1]  6  5  8  4  1  2  8  9 11
> save.em$hap2code
[1] 13 14  3  7 16 15  7 12 10

alexlancaster commented 7 years ago

Interesting. It's consistent with between Python and R on the same platform. But it's not consistent cross-platform. For example on Linux, my output for those R commands matches my Python test case:

> save.em$haplotype
   DRB  B
1    1 27
2    1 62
3    2  7
4    2 44
5    4 61
6    4 62
7    7  7
8    7 44
9    8 51
10   8 55
11  11 51
12  11 55
13  11 61
14  11 62
15  13 27
16  13 62
> save.em$subj.id
[1] 1 1 2 2 3 3 4 5 5
> save.em$hap1code
[1]  6  5  3  4  1  2  7  9 10
> save.em$hap2code
[1] 13 14  8  7 16 15  8 12 11

It only affects the hap1code, hap2code, at least in this particular example. I suspect there might be some subtle platform-dependent effect, either from the random number seed or in how it handles integers.

In any case, it's an issue that affects the haplo.stats R package itself between platforms, so PyPop isn't introducing any additional problems. Maybe @rsingle will have some insights here. It's probably worth reporting upstream to the haplo.stats developers if they're not already aware of it.

sjmack commented 7 years ago

I should try installing PyPop on a Windows machine to see if we get still different results.

In my windows machine, with R haplo.stats I get these results:

> save.em$haplotype
   DRB  B
1    1 27
2    1 62
3    2  7
4    2 44
5    4 61
6    4 62
7    7  7
8    7 44
9    8 51
10   8 55
11  11 51
12  11 55
13  11 61
14  11 62
15  13 27
16  13 62
> save.em$subj.id
[1] 1 1 2 2 3 3 4 5 5
> save.em$hap1code
[1]  6 14  8  4  1 15  7 12 11
> save.em$hap2code
[1] 13  5  3  7 16  2  8  9 10

So here again are a different assignment ordering.

It isn't random; here's a re-run on my Mac install of R:

> save.em$haplotype
   DRB  B
1    1 27
2    1 62
3    2  7
4    2 44
5    4 61
6    4 62
7    7  7
8    7 44
9    8 51
10   8 55
11  11 51
12  11 55
13  11 61
14  11 62
15  13 27
16  13 62
> save.em$subj.id
[1] 1 1 2 2 3 3 4 5 5
> save.em$hap1code
[1]  6  5  8  4  1  2  8  9 11
> save.em$hap2code
[1] 13 14  3  7 16 15  7 12 10

alexlancaster commented 7 years ago

Yes, it definitely a platform-specific issue and it's in the R code itself, not PyPop per se. So I'm not going to spend too much more effort investigating it right now, and I'm going to continue on the integration (adding the .ini section, adding to the XML output etc), but will make a note of it. I suspect it may not matter much, so long as the actual haplotype frequencies and probabilities, etc. are consistent from platform to platform. It might make sense to remove hap1code and hap2code test for the moment.

alexlancaster commented 7 years ago

@sjmack OK, I've started converting the internal data structures into the XML format. Try running: ./haplo-stats/haplo-test.py You should see something like the following at the end of the output:

<group loci="DRB:B" showHaplo="yes" mode="all-pairwise-ld-no-permu">
<haplocount>16</haplocount>
<loglikelihood role="no-ld">-20.4231612445</loglikelihood>
<haplotypefreq>
<condition role="converged"></condition>
<haplotype name="1~27"><frequency>0.05</frequency></haplotype>
<haplotype name="1~62"><frequency>0.05</frequency></haplotype>
<haplotype name="2~7"><frequency>0.05</frequency></haplotype>
<haplotype name="2~44"><frequency>0.05</frequency></haplotype>
<haplotype name="4~61"><frequency>0.05</frequency></haplotype>
<haplotype name="4~62"><frequency>0.05</frequency></haplotype>
<haplotype name="7~7"><frequency>0.15</frequency></haplotype>
<haplotype name="7~44"><frequency>0.15</frequency></haplotype>
<haplotype name="8~51"><frequency>0.05</frequency></haplotype>
<haplotype name="8~55"><frequency>0.05</frequency></haplotype>
<haplotype name="11~51"><frequency>0.05</frequency></haplotype>
<haplotype name="11~55"><frequency>0.05</frequency></haplotype>
<haplotype name="11~61"><frequency>0.05</frequency></haplotype>
<haplotype name="11~62"><frequency>0.05</frequency></haplotype>
<haplotype name="13~27"><frequency>0.05</frequency></haplotype>
<haplotype name="13~62"><frequency>0.05</frequency></haplotype>
</haplotypefreq>
</group>

The idea is to make sure I'm mapping the correct values over and decide what is and isn't appropriate to port from emhaplofreq XML output. For comparison here is a sample XML from the UchiTelle population (I've trimmed some of the stuff in the various sections to keep it short):

<group mode="all-pairwise-ld-no-permu" loci="A:C" showHaplo="no">

<individcount role="before-filtering">10</individcount>
<individcount role="after-filtering">8</individcount>
<uniquepheno>8</uniquepheno>
<uniquegeno>14</uniquegeno>
<haplocount>26</haplocount>
<loglikelihood role="no-ld">-50.350627</loglikelihood>
<iterationsummary>
<![CDATA[
--- Iteration Summary for Original Data -------------------------------------------
Init. condition   0: Log likelihood after   8 iterations: -38.123095, error_flag: 0 

]]></iterationsummary>
<haplotypefreq>
<loginfo><![CDATA[
Percent of iterations with error_flag = 0: 100.000
Percent of iterations with error_flag = 2:   0.000

7: Log likelihood failed to converge in 400 iterations 
-----------------------------------------------------------------------------------

]]></loginfo>
<condition role="converged"/>
<iterConverged>22</iterConverged><loglikelihood>-36.043653</loglikelihood>
<haplotype name="0210~0102"><frequency>0.12500</frequency><numCopies>2.0</numCopies></haplotype>
<haplotype name="2501~0307"><frequency>0.12500</frequency><numCopies>2.0</numCopies></haplotype>
<haplotype name="03012~0712"><frequency>0.06250</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="0101~0804"><frequency>0.06250</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="0218~1202"><frequency>0.06250</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="0201~1507"><frequency>0.06250</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="3204~1801"><frequency>0.06250</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="03012~0605"><frequency>0.06250</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="3204~1507"><frequency>0.06250</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="3204~0307"><frequency>0.06250</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="0201~0102"><frequency>0.06250</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="6814~0712"><frequency>0.06250</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="0201~02025"><frequency>0.06250</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="0201~1202"><frequency>0.06250</frequency><numCopies>1.0</numCopies></haplotype>

</haplotypefreq>

e.g do we want to capture all the the info in the CDATA sections (at least in the cases that there is a haplo.stats-equivalent)? For example, does the uniquegeno and the uniquepheno still make sense in this context?

sjmack commented 7 years ago

Here's what I get for./haplo-stats/haplo-test.py:

unique_alleles: ['1', '2', '4', '7', '8', '11', '13']
unique_alleles: ['7', '27', '44', '51', '55', '61', '62']
START inside C program
n_loci: 2
i hap_prob[i]   u_hap_code[i]   k   u_hap[k]
 0   0.05000         0          0    1    2 
 1   0.05000         1          2    1    7 
 2   0.05000         2          4    2    1 
 3   0.05000         3          6    2    3 
 4   0.05000         4          8    3    6 
 5   0.05000         5         10    3    7 
 6   0.15000         6         12    4    1 
 7   0.15000         7         14    4    3 
 8   0.05000         8         16    5    4 
 9   0.05000         9         18    5    5 
10   0.05000        10         20    6    4 
11   0.05000        11         22    6    5 
12   0.05000        12         24    6    6 
13   0.05000        13         26    6    7 
14   0.05000        14         28    7    2 
15   0.05000        15         30    7    7 
hap1   hap2
  5   12
  4   13
  7    2
  3    6
  0   15
  1   14
  7    6
  8   11
 10    9
END inside C program
[['1' '27']
 ['1' '62']
 ['2' '7']
 ['2' '44']
 ['4' '61']
 ['4' '62']
 ['7' '7']
 ['7' '44']
 ['8' '51']
 ['8' '55']
 ['11' '51']
 ['11' '55']
 ['11' '61']
 ['11' '62']
 ['13' '27']
 ['13' '62']]
[[ 1  6 13]
 [ 1  5 14]
 [ 2  8  3]
 [ 2  4  7]
 [ 3  1 16]
 [ 3  2 15]
 [ 4  8  7]
 [ 5  9 12]
 [ 5 11 10]]
 converge: 1
 lnlike: -20.4231612445
 n_u_hap: 16
 n_hap_pairs: 9
 hap_prob: [0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.15, 0.15, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05]
 u_hap: [1, 2, 1, 7, 2, 1, 2, 3, 3, 6, 3, 7, 4, 1, 4, 3, 5, 4, 5, 5, 6, 4, 6, 5, 6, 6, 6, 7, 7, 2, 7, 7]
 u_hap_code: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
 subj_id: [1, 1, 2, 2, 3, 3, 4, 5, 5]
 post: [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1.0, 0.5, 0.5]
 hap1_code: [6, 5, 8, 4, 1, 2, 8, 9, 11]
 hap2_code: [13, 14, 3, 7, 16, 15, 7, 12, 10]
hap_prob  u_hap_code u_hap(needs to be split for printing)
[[  0.05   1.  ]
 [  0.05   2.  ]
 [  0.05   3.  ]
 [  0.05   4.  ]
 [  0.05   5.  ]
 [  0.05   6.  ]
 [  0.15   7.  ]
 [  0.15   8.  ]
 [  0.05   9.  ]
 [  0.05  10.  ]
 [  0.05  11.  ]
 [  0.05  12.  ]
 [  0.05  13.  ]
 [  0.05  14.  ]
 [  0.05  15.  ]
 [  0.05  16.  ]]
subj_id  hap1_code  hap2_code
[[ 1  6 13]
 [ 1  5 14]
 [ 2  8  3]
 [ 2  4  7]
 [ 3  1 16]
 [ 3  2 15]
 [ 4  8  7]
 [ 5  9 12]
 [ 5 11 10]]
sample XML output
<group loci="DRB:B" showHaplo="yes" mode="all-pairwise-ld-no-permu">
<uniquegeno>9</uniquegeno>
<haplocount>16</haplocount>
<loglikelihood role="no-ld">-20.4232</loglikelihood>
<haplotypefreq>
<condition role="converged"></condition>
<haplotype name="1~27"><frequency>0.05</frequency></haplotype>
<haplotype name="1~62"><frequency>0.05</frequency></haplotype>
<haplotype name="2~7"><frequency>0.05</frequency></haplotype>
<haplotype name="2~44"><frequency>0.05</frequency></haplotype>
<haplotype name="4~61"><frequency>0.05</frequency></haplotype>
<haplotype name="4~62"><frequency>0.05</frequency></haplotype>
<haplotype name="7~7"><frequency>0.15</frequency></haplotype>
<haplotype name="7~44"><frequency>0.15</frequency></haplotype>
<haplotype name="8~51"><frequency>0.05</frequency></haplotype>
<haplotype name="8~55"><frequency>0.05</frequency></haplotype>
<haplotype name="11~51"><frequency>0.05</frequency></haplotype>
<haplotype name="11~55"><frequency>0.05</frequency></haplotype>
<haplotype name="11~61"><frequency>0.05</frequency></haplotype>
<haplotype name="11~62"><frequency>0.05</frequency></haplotype>
<haplotype name="13~27"><frequency>0.05</frequency></haplotype>
<haplotype name="13~62"><frequency>0.05</frequency></haplotype>
</haplotypefreq>
</group>

I think the CDATA summaries are important -- we should keep them. I also think that keeping/reporting the per-subject haplotype assignments is important too, and would increase PyPop's utility.

alexlancaster commented 7 years ago

Great, excellent to see it's working for you.

For CDATA, just to be clear I am keeping them for the emhaplofreq case. For haplo-stats, the logging info it does is quite different, and there is not nearly as much of it, and due to the way we interact with the program, isn't as easy to export to the XML. So, at least at first, there will probably not be equivalent information.

On this subject, given the current resources, I think we should start outlining what should be in the haplo-stats output. I propose we consider three (or maybe more) categories:

minimal goals: this consists of the information needed for basic "minimal" functionality that can be achieved within the remaining project chunk. This would include the XML tags and their information. At the moment, I am working on completing all the information (where it makes sense) under the <haplotypefreq> XML section, with the haplotype frequencies as show in the example above. It would have the same ability to choose the loci to estimate haplotypes on, and the same .ini file semantics, i.e. lociToEstHaplo. Here we would also spell out the expected ini file commands.
stretch goals: additional output/information and functionality. Currently it's not clear to me how the LD information will be output by the haplo-stats code, and how this interacts with the haplotype frequency estimation. I think @rsingle mentioned it might needed to be added as fresh code (i.e there is no R code to follow as an example). So for the moment I would put the all the "LD" options such as allPairwiseLD into this category.
future goals: nice-to-have for the medium/long-term; e.g. extra metadata like the above CDATA and additional functionality that goes beyond mimicing the options in emhaplofreq would go here. Since these are more enhancement features, we would probably open up each as new github issue so we can keep track of them.

Also, currently the logic for how selecting specified loci, allPairwise options, and LD options all interact together are somewhat convoluted, both in terms of implementation, as well as user documentation. It would be good to revisit that logic and clean it up a bit. Since we're likely to end up deprecating the emhaplofreq module, it might make sense to start fresh on the names of the options (if they should change) and their semantics of the ini file for haplo-stats now, rather than attempting to mimic the entire existing .ini file syntax that is used for the emhaplofreq module. Classifying the features we need into some kind of typology like I gave above would be a good way to start this process.

sjmack commented 7 years ago

I was sort of hurrying when I posted my previous comment in the thread, and I wasn't looking as closely as I should have at what was in CDATA. What was focusing on is quoted below, which is actually before the CDATA.

<individcount role="before-filtering">10</individcount>
<individcount role="after-filtering">8</individcount>
<uniquepheno>8</uniquepheno>
<uniquegeno>14</uniquegeno>
<haplocount>26</haplocount>
<loglikelihood role="no-ld">-50.350627</loglikelihood>

So the emhaplofreq CDATA cognates are fine.

sjmack commented 7 years ago

For stretch goal 2, since @rsingle already is involved, and R dependency is already being built into PyPop, we may want to consider incorporating his assymLD R package (and future iterations that pull in more LD measures).

alexlancaster commented 7 years ago

OK, thanks for the clarification on the CDATA. Some of these just don't have haplo.stats cognates, but will port those that do.

As I mentioned, doing the assymLD probably not feasible just yet because we don't use R directly. haplo.stats only wraps the C module.

alexlancaster commented 7 years ago

I have been able to remove the max_haps = 18 hardcoded limit, and implemented the computation of the number of possible haplotype pairs in Python (this was in the R geno.count.pairs function). This should clear a lot of hurdles for using arbitrary data sizes. A lot of the complexity in the R version seems to be there because of the handling of missing data in the input genotype matrix. Currently rows with missing data (i.e. ****) in PyPop are removed before sending to emhaplofreq (as Rich points out, emhaplofreq doesn't understand missing data), so we should probably do the same here. This allows us to simplify the code considerably, since we don't need to handle all the complexities of missing data. I've put a note in the code to revisit this if necessary later on.

sjmack commented 7 years ago

Yes. Its very important to remove the missing data. For BIGDAWG (which uses haplo.stats) the table of all haplotype possibilities grows rapidly as the # missing increases, and we recommend # missing <=2 (preferably 0). Forcing it to be 0 for PyPop makes sense.

alexlancaster commented 7 years ago

@sjmack great, good to clarify that, it will make the module implementation considerably simpler.

alexlancaster commented 7 years ago

For all following, a preliminary pairwise LD estimation is now in haplo-stats branch. There are corresponding unit tests in test_Haplostats.py This info is not yet output to the XML/txt file. After that the main piece now is to get the logic in place to allow the estHaplotypes() function to be called on arbitrary subsets of the original input. Currently it only works on the entire input file, to do this requires some internal rejiggery of data structures.

alexlancaster commented 7 years ago

@sjmack and @rsingle: new version in haplo-stats branch to test that can be run end-to-end via the bin/pypop.py script. It's a little rough overall, but it works. This implements two .ini options (under the [Haplostats] section): lociToEstHaplo which works more-or-less as the version in Emhaplofreq, and allPairwise which does pairwise LD.

 git checkout haplo-stats
 ./setup.py test

To test a simple run, with a two-locus data set, try:

 ./bin/pypop.py -c tests/data/Test_Small_Haplostats.ini tests/data/Test_Small_Haplostats.pop

This will do both the full data set and allPairwise (which in this particular case, is the same thing). A 3-locus dataset try:

 ./bin/pypop.py -c tests/data/Test_Larger_Haplostats.ini tests/data/Test_Larger_Haplostats.pop

This will do all-pairs of loci (3 in this case) as well the haplotype estimation for the 3-locus haplotypes.

Caveats:

the output from the command-line is a mess, it sends lots of debugging stuff to the console, ignore for the moment
the text output works but there are duplicate entries in some of the output tables, this is because sometimes we are repeating the analyses, there is some complicated logic for the emhaplofreq which I'm loathe to duplicate here. The main thing to check is whether the values are all there.
this does not yet handle missing data by removing rows in any given submatrix that have missing data, this requires more StringMatrix jiggery-pokery, so for the moment, make sure no rows have missing data. This will be the next priority item after you've kicked the tires on this.
d_ij stuff is not yet integrated
probably a heap more stuff I'm forgetting

After 3) is done and looks OK, I'd like to merge to master and start testing this in earnest. It would be good if the Test_Larger_Haplostats.pop file could be modified to match a real example so we can do more unit testing on it. I just took the haplo.stats example we've been using (originally from the hla.demo package) and made up an extra C locus with bogus data.

rsingle commented 7 years ago

This looks good and output for LD matches from other programs. I've tried some larger files, but it is hard to compare exact LD stats. I think this is because there is only one starting condition being used currently (and other programs use multiple - thus slightly different estimates).

This is not an issue for the smaller datasets where there are probably no local maxima for the log likelihood.

Rich

On 8/17/2017 11:19 PM, Alex Lancaster wrote:

@sjmack https://github.com/sjmack and @rsingle https://github.com/rsingle: new version in |haplo-stats| branch to test that can be run end-to-end via the |bin/pypop.py| script. This implements two options: |lociToEstHaplo| which works more-or-less as the version in |Emhaplofreq|, and |allPairwise| which does pairwise LD.

|git checkout haplo-stats ./setup.py test |

To test a simple run, with a two-locus data set, try:

|./bin/pypop.py -c tests/data/Test_Small_Haplostats.ini tests/data/Test_Small_Haplostats.pop |

This will do both the full data set and allPairwise (which in this particular case, is the same thing). A 3-locus dataset try:

|./bin/pypop.py -c tests/data/Test_Larger_Haplostats.ini tests/data/Test_Larger_Haplostats.pop |

This will do all-pairs of loci (3 in this case) as well the haplotype estimation for the 3-locus haplotypes.

Caveats:

the output from the command-line is a mess, it sends lots of debugging stuff to the console, ignore for the moment

the text output works but there are duplicate entries in some of the output tables, this is because sometimes we are repeating the analyses, there is some complicated logic for the emhaplofreq which I'm loathe to duplicate here. The main thing to check is whether the values are all there.

this does /not/ yet handle missing data by removing rows in any given submatrix that have missing data, this requires more |StringMatrix| jiggery-pokery, so for the moment, make sure no rows have missing data. This will be the next priority item after you've kicked the tires on this.

|d_ij| stuff is not yet integrated

probably a heap more stuff I'm forgetting

After 3) is done and looks OK, I'd like to merge to master and start testing this in earnest. It would be good if the |Test_Larger_Haplostats.pop| file could be modified to match a real example so we can do more unit testing on it. I just took the |haplo.stats| example we've been using (originally from the |hla.demo| package) and made up an extra |C| locus with bogus data.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alexlancaster/pypop/issues/28#issuecomment-323249099, or mute the thread https://github.com/notifications/unsubscribe-auth/AFdJcnkKzqHlcPEoXTZa6-JzJbeHeqljks5sZQK_gaJpZM4OfvJ7.

--

Richard M. Single, Ph.D. Dept. of Mathematics and Statistics University of Vermont Burlington, VT 05405 802-656-8631, Richard.Single@uvm.edu

rsingle commented 7 years ago

As a follow-up ... below is the summary of iterations from one of the larger files I was running. You can see that the max (-76.729274) is achieved from a few starting conditions, but there are lots of local maxima. Comparing results from init.cond 0 to init.cond 4 for LD will give some lack of agreement (general trends will be similar though).

--- Iteration Summary for Original Data

Init. condition 0: LL after 15 iters: -85.047040, error_flag: 0 Init. condition 1: LL after 76 iters: -77.252522, error_flag: 0 Init. condition 2: LL after 15 iters: -80.025105, error_flag: 0 Init. condition 3: LL after 61 iters: -78.638816, error_flag: 0 Init. condition 4: LL after 12 iters: -76.729274, error_flag: 0 Init. condition 5: LL after 13 iters: -78.638816, error_flag: 0 Init. condition 6: LL after 12 iters: -84.707242, error_flag: 0 Init. condition 7: LL after 23 iters: -76.729274, error_flag: 0 Init. condition 8: LL after 17 iters: -80.888127, error_flag: 0 Init. condition 9: LL after 14 iters: -77.252522, error_flag: 0 Init. condition 10: LL after 15 iters: -78.638816, error_flag: 0 ...

On 8/18/2017 10:55 AM, Richard M. Single wrote:

This looks good and output for LD matches from other programs. I've tried some larger files, but it is hard to compare exact LD stats. I think this is because there is only one starting condition being used currently (and other programs use multiple - thus slightly different estimates).

This is not an issue for the smaller datasets where there are probably no local maxima for the log likelihood.

Rich

On 8/17/2017 11:19 PM, Alex Lancaster wrote:

@sjmack https://github.com/sjmack and @rsingle https://github.com/rsingle: new version in |haplo-stats| branch to test that can be run end-to-end via the |bin/pypop.py| script. This implements two options: |lociToEstHaplo| which works more-or-less as the version in |Emhaplofreq|, and |allPairwise| which does pairwise LD.

|git checkout haplo-stats ./setup.py test |

To test a simple run, with a two-locus data set, try:

|./bin/pypop.py -c tests/data/Test_Small_Haplostats.ini tests/data/Test_Small_Haplostats.pop |

This will do both the full data set and allPairwise (which in this particular case, is the same thing). A 3-locus dataset try:

|./bin/pypop.py -c tests/data/Test_Larger_Haplostats.ini tests/data/Test_Larger_Haplostats.pop |

This will do all-pairs of loci (3 in this case) as well the haplotype estimation for the 3-locus haplotypes.

Caveats:

the output from the command-line is a mess, it sends lots of debugging stuff to the console, ignore for the moment

the text output works but there are duplicate entries in some of the output tables, this is because sometimes we are repeating the analyses, there is some complicated logic for the emhaplofreq which I'm loathe to duplicate here. The main thing to check is whether the values are all there.

this does /not/ yet handle missing data by removing rows in any given submatrix that have missing data, this requires more |StringMatrix| jiggery-pokery, so for the moment, make sure no rows have missing data. This will be the next priority item after you've kicked the tires on this.

|d_ij| stuff is not yet integrated

probably a heap more stuff I'm forgetting

After 3) is done and looks OK, I'd like to merge to master and start testing this in earnest. It would be good if the |Test_Larger_Haplostats.pop| file could be modified to match a real example so we can do more unit testing on it. I just took the |haplo.stats| example we've been using (originally from the |hla.demo| package) and made up an extra |C| locus with bogus data.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alexlancaster/pypop/issues/28#issuecomment-323249099, or mute the thread https://github.com/notifications/unsubscribe-auth/AFdJcnkKzqHlcPEoXTZa6-JzJbeHeqljks5sZQK_gaJpZM4OfvJ7.

--

Richard M. Single, Ph.D. Dept. of Mathematics and Statistics University of Vermont Burlington, VT 05405 802-656-8631, Richard.Single@uvm.edu

alexlancaster commented 7 years ago

Yes, the number of iterations is set to 1 currently, since we've been mainly in testing mode. I will add the iterations as an .ini file option, but we should also have some kind of default, probably whatever haplo.stats uses for easy comparison.

alexlancaster commented 7 years ago

@rsingle OK, haplo.stats seems to use 10 as the default, so I set that as the global default. Try a git pull. I also added an .ini file option so the following should work

[Haplostats]
allPairwise=1
numInitCond=5

I output this in the <iterConverged>5</iterConverged> XML tag for the moment, although I know this isn't technically correct, I wanted to get it in there so we can check that it's done all the iterations, we can change it later. Does haplo.stats have an equivalent of the number of iterations until convergence? It wasn't clear to me from the code.

rsingle commented 7 years ago

Hi Alex,

That helps. I can match up the LD statistics now from running a test on Bluemoon. I do have some questions below.

Regarding iters, there is no record kept of the number of iters from haplostats. max.iter is passed as an argument to the .c program, but it is only used to bail out of iterations if they go on for too long. The actual # of iters is not returned.

I notice that there are 2 entries for LD stats (see output below) for the locus pair. I'm guessing that this corresponds to the first initial condition and then the results from the "best". Is that right? The 2nd line (D'=0.638008 Wn=0.602736) agree with emhaplofreq on Bluemoon.

Below the LD results is the screen output from running pypop. What are the 3 numbers printed for each row (there were 18 rows for this run) Are they the seeds for the random # generator maybe? I see the "found a better lnlikelihood! -1391.93797682" comment. That is the same LL as shown for the first row of LD stats (rounding to 2 decimal places). For testing it might help to have a few more decimal places. I'm guessing that the LLs are actually slightly different. There is a pretty big diff in Wn values for the 2.

Pairwise LD estimates

Locus pair D D' Wn ln(L_1) ln(L_0) S # permu p-value A:B *0.640559 0.554153 -1391.94 NaN -

A:B *0.638008 0.602736 -1391.94 NaN -

rich@ubuntu:~/Documents/github/pypop$ ./bin/pypop.py -c emhaplofreq/Test_LargerPopId_Haplostats.ini emhaplofreq/test-guarani.pop LOG: Data file has no header data block 23455 13636 17703 12837 26012 11956 11730 23343 26651 found a better lnlikelihood! -1391.93797682 21829 25797 29118 16869 14218 25712 17220 13891 12363 12825 13287 16863 23780 18907 11446 21203 24220 22737 23987 21573 28827 23225 19469 13839 25545 15151 26443 13074 20115 19772 20621 15708 15035 11370 24345 12727 29129 21281 28240 15900 23370 22237 10985 23201 11131

On 8/18/2017 5:05 PM, Alex Lancaster wrote:

@rsingle https://github.com/rsingle OK, |haplo.stats| seems to use 10 as the default, so I set that as the global default. Try a git pull. I also added an .ini file option so the following should work

|[Haplostats] allPairwise=1 numInitCond=5 |

I output this in the |5| XML tag for the moment, although I know this isn't technically correct, I wanted to get it in there so we can check that it's done all the iterations, we can change it later. Does |haplo.stats| have an equivalent of the number of iterations until convergence? It wasn't clear to be from the code.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alexlancaster/pypop/issues/28#issuecomment-323460574, or mute the thread https://github.com/notifications/unsubscribe-auth/AFdJcvGI6Up4wrMMrBhzYLG-DlKi8yxCks5sZfyLgaJpZM4OfvJ7.

--

Richard M. Single, Ph.D. Dept. of Mathematics and Statistics University of Vermont Burlington, VT 05405 802-656-8631, Richard.Single@uvm.edu

alexlancaster commented 7 years ago

The dupes are I think, because it's outputting info to the XML twice: once for the haplotypes and LD for the specified loci (which is "*" which resolves to the pairs), then again in all-pairwise mode for LD. This is what I was talking about in caveat 2 in https://github.com/alexlancaster/pypop/issues/28#issuecomment-323249099 they should be the same in principle, so I don't know why the stats are off so much. I'd run it several times without the all pairwise option just specifying the lociToEst for any specific pair.

If we actually need the iterations from the C code, we could add it to the list of returned values, if you can find the right location they are generated. Although it could be a big timesink to do that, so how important is it to have that information for now? The found a better likelihood was debugging I put in to check the loop, check the Haplo.py code there's a print statement there. In general you should have a look at the Haplo.py code yourself and see if it duplicates the logic of the R equivalent and feel free to add further debug statements to see if it is doing what you expect.

alexlancaster commented 7 years ago

Another thing, so the numInitCond (which is currently done in the Python code) is different to the number of iterations (done within the C), is that right? The iterations is for the convergence of the EM algorithm itself, and the numInitCond (n.tries in the R code) is for the number of times we restart the EM algorithm, is that right? If so, since there's going to be a different number of iterations for each condition, which one would we keep, just the last one that resulted in an improved log-likelihood ratio, I assume?

Looking in the C code it does record the the number of iterations reached in the local variable n_iter in haplo_em_pin(), we could return this value from the function as well if need be.

rsingle commented 7 years ago

You are right about the number of initial conditions (n.tries) vs. number of iters. If we did record the number of iters, we would want it to correspond to the best log likelihood. So it would mean keeping track of it for each initial condition and swapping the value each time you find a better log likelihood. That said, I think this is a lower priority, especially since it means changing the c code.

Rich

Sent from my iPad

On Aug 19, 2017, at 11:20 PM, Alex Lancaster notifications@github.com wrote:

Another thing, so the numInitCond (which is currently done in the Python code) is different to the number of iterations (done within the C), is that right? The iterations is for the convergence of the EM algorithm itself, and the numInitCond (n.tries in the R code) is for the number of times we restart the EM algorithm, is that right? If so, since there's going to be a different number of iterations for each condition, which one would we keep, just the last one that resulted in an improved log-likelihood ration?

Looking in the C code it does record the the number of iterations reached in the local variable n_iter in haplo_em_pin(), we could return this value from the function as well if need be.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

rsingle commented 7 years ago

That makes sense and I could follow it in the Haplo.py code. I bumped up the numInitCond to 15 and it took a lot of tries until I found substantially different LD results for the 2 versions (lociToEstHaplo=* and allPairwise=1).

Rich

On 8/19/2017 11:10 PM, Alex Lancaster wrote:

The dupes are I think, because it's outputting info to the XML twice: once for the haplotypes and LD for the specified loci (which is "*" which resolves to the pairs), then again in all-pairwise mode for LD. This is what I was talking about in caveat 2 in #28 (comment) https://github.com/alexlancaster/pypop/issues/28#issuecomment-323249099 they should be the same in principle, so I don't know why the stats are off so much. I'd run it several times without the all pairwise option just specifying the |lociToEst| for any specific pair.

If we actually need the iterations from the C code, we could add it to the list of returned values, if you can find the right location they are generated. Although it could be a big timesink to do that, so how important is it to have that information for now? The found a better likelihood was debugging I put in to check the loop, check the |Haplo.py| code there's a print statement there. In general you should have a look at the |Haplo.py| code yourself and see if it duplicates the logic of the R equivalent and feel free to add further debug statements to see if it is doing what you expect.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alexlancaster/pypop/issues/28#issuecomment-323560580, or mute the thread https://github.com/notifications/unsubscribe-auth/AFdJcsxmBtAr8qMaArJf05W6co4YW18Xks5sZ6OVgaJpZM4OfvJ7.

--

Richard M. Single, Ph.D. Dept. of Mathematics and Statistics University of Vermont Burlington, VT 05405 802-656-8631, Richard.Single@uvm.edu

alexlancaster commented 7 years ago

@rsingle OK, I've implement a testMode for calling Haplostats . The logic is: if calling in testMode (which is now used in the unit tests), we fix random number seeds for initial as well as subsequent calls of haplo_em_pin to be fixed series of integers otherwise we set random_start=0 for the first set of tries (which uses the input seed variables iseed1, iseed2, and iseed3) and then random_start=1 for the subsequent tries (this should match the R code, which doesn't have testMode).

You can use testMode from the PyPop command-line by adding -m as an option, e.g.:

 ./bin/pypop.py -m -c tests/data/Test_LargerPopId_Haplostats.ini emhaplofreq/test_nomiss.pop

In theory this means that all the output should be identical from run to run, if there are differences we should track them down to make sure we're running in a completely deterministic way.

sjmack commented 7 years ago

I just did a git pull in haplo-stats, ran ./bin/pypop.py -c tests/data/Test_Small_Haplostats.ini tests/data/Test_Small_Haplostats.pop, found a bunch of errors, and then switched back and forth between hallo-stats and the master, and discovered a new verison that fixed the errors! Nice!

alexlancaster commented 7 years ago

OK, try again on haplo-stats branch and re-run:

 ./setup.py test

I just updated unit tests a bit to check LD and ALD. I'm about to do a pull request to merge this version into master for testing and send an e-mail out. There will still be a fairly rough version needing more testing.

alexlancaster commented 7 years ago

I made another small adjustment to fix the unit tests, you might to do a new git pull.

Two major things of the new version:

doesn't remove rows with missing data yet
if you have locToEstHaplo=* in a two locus file, or lociToEstHaplo=locus1:locus2, don't use this together with allPairwise=1 in the same .ini file, otherwise it will run the analysis for the same loci set twice, and you may get different LD/haplotype values, which can introduce confusion. we'll figure out how to deal with this down the road

sjmack commented 7 years ago

I'm afraid that there is still a problem with this HaploStats implementation. The bug that I thought had been addressed yesterday is back in the new master branch.

When I run ./bin/pypop.py -c tests/data/Test_Small_Haplostats.ini tests/data/Test_Small_Haplostats.pop there are errors in the <haplostats> haplotype frequency estimates (and the haplotypes are still duplicated in the out.txt and out.xml files).

The more critical error is illustrated below (from the out.txt):

Haplotype frequency est. for loci: DRB:B
----------------------------------------
Unique genotypes: 6
Number of haplotypes: 10
Loglikelihood under linkage equilibrium [ln(L_0)]: 
Loglikelihood obtained via the EM algorithm [ln(L_1)]: -18.1738
Number of iterations before convergence: 

Haplotypes sorted by name             | Haplotypes sorted by frequency        
haplotype            frequency# copies| haplotype            frequency# copies
11~51                0.1              | 7~44                 0.1999994        
11~61                0.1              | 7~7                  0.1000005        
13~27                0.1              | 1~62                 0.1              
1~62                 0.1              | 4~62                 0.1              
2~44                 5.0074985        | 8~55                 0.1              
2~7                  0.0999994        | 11~51                0.1              
4~62                 0.1              | 11~61                0.1              
7~44                 0.1999994        | 13~27                0.1              
7~7                  0.1000005        | 2~7                  0.0999994        
8~55                 0.1              | 2~44                 5.0074985

First off, in the .pop file, there are only 10 possible haplotypes, because there are only 5 rows of genotype data.

DRB_1   DRB_2   B_1     B_2
4       11      62      61
2       7       7       44
1       13      27      62
7       7       7       44
8       11      51      55

However, there is only one DRB2 allele, so there should only be one DRB102~B haplotype, and yet there are two DRB*2 haplotypes reported (2~44 and 2~7).

More importantly, the frequency of 2~44 is 5.0074985, which is obviously an error.

Here are the <haplostats> covered haplotypes from the .xml:

  <haplostats>
<group loci="DRB:B" showHaplo="yes" mode="all-pairwise-ld-no-permu">
<uniquegeno>6</uniquegeno>
<haplocount>10</haplocount>
<haplotypefreq><numInitCond>10</numInitCond>
<loglikelihood role="no-ld">-18.1738</loglikelihood>

<condition role="converged"></condition>
<haplotype name="1~27"><frequency>0.1</frequency></haplotype>
<haplotype name="2~7"><frequency>5.5802122053e-07</frequency></haplotype>
<haplotype name="2~44"><frequency>0.0999994419788</frequency></haplotype>
<haplotype name="4~61"><frequency>0.1</frequency></haplotype>
<haplotype name="7~7"><frequency>0.199999441979</frequency></haplotype>
<haplotype name="7~44"><frequency>0.100000558021</frequency></haplotype>
<haplotype name="8~55"><frequency>0.1</frequency></haplotype>
<haplotype name="11~51"><frequency>0.1</frequency></haplotype>
<haplotype name="11~62"><frequency>0.1</frequency></haplotype>
<haplotype name="13~62"><frequency>0.1</frequency></haplotype>
</haplotypefreq>

Here are the <emhaplofreq> converged haplotypes from the out.xml

<condition role="converged"/>
<iterConverged>24</iterConverged><loglikelihood>-18.173821</loglikelihood>
<haplotype name="7~7"><frequency>0.20000</frequency><numCopies>2.0</numCopies></haplotype>
<haplotype name="7~44"><frequency>0.10000</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="11~62"><frequency>0.10000</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="4~61"><frequency>0.10000</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="13~27"><frequency>0.10000</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="1~62"><frequency>0.10000</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="11~51"><frequency>0.10000</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="8~55"><frequency>0.10000</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="2~44"><frequency>0.10000</frequency><numCopies>1.0</numCopies></haplotype>

The <haplostats> frequency value for 2~44 of 5.0074985 is actually 5.0074985e-7, which when added to the frequency of 2~7 (0.0999994) sums to 0.1, which is the frequency reported in the converged <emhaplofreq> block for 2~7. Compare this to a similar (but not identical) issue for the frequencies of 7~7 and 7~44.

Also, while the out.xt file suggests that haplotype counts (# copies) are going to be presented, they aren't (because they don't seem to be present in the <haplostats> section of the .xml.

sjmack commented 7 years ago

Should have reopened.

alexlancaster commented 7 years ago

Hey Steve,

Can you open this up as a new bug? I'm closing this one because it's now integrated into master, so as a feature it's now part of PyPop, so this should be a new issue. And it's easier to manage new issues as individual discrete issues, rather than in this thread.

Thanks Alex

Closing.

On August 22, 2017 2:02:14 PM EDT, sjmack notifications@github.com wrote:

Should have reopened.

-- You are receiving this because you modified the open/close state. Reply to this email directly or view it on GitHub: https://github.com/alexlancaster/pypop/issues/28#issuecomment-324105461

rbpisupati commented 6 years ago

Hello, Is the missing information taken care of already? or we need to remove the rows which contain missing data?

sjmack commented 6 years ago

@rbpisupati, when missing data is defined in the .ini file (the default notation is "****"), it is automatically removed by the [Emhaplofreq] module. However, missing data is not currently removed when using the [Haplostats] module. Examples are provided in issue #41.