Closed alexlancaster closed 7 years ago
@sjmack the haplo-stats branch is ready for some initial testing:
git pull
git checkout haplo-stats
git pull
./setup.py build
You can run one of the tests:
py.test -s -v tests/test_Haplostats.py
This will print some diagnostic information with output etc.
Caveats: this is not yet integrated with the .ini file or the txt/XML output, this is very far from being finished. There is quite a bit of work to go make sure all the measures line up and computing the correct haplotype names and frequency information. The way that haplo-stats calculates/represents haplotypes is quite different to emhaplofreq. But this prototype shows that, in principle, we can get all the information we need at least for haplotype frequency (LD is a different matter, @rsingle can maybe comment on that). Testing very large data files is yet another step beyond that again.
Heres what I got. That looks like haplo.stats output!
================================================================== test session starts ===================================================================
platform darwin -- Python 2.7.13, pytest-3.1.2, py-1.4.34, pluggy-0.4.0 -- /opt/local/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python
cachedir: .cache
rootdir: /Applications/PyPop/pypop, inifile:
collected 3 items
tests/test_Haplostats.py::test_Haplostats_Simple START inside C program
n_loci: 2
i hap_prob[i] u_hap_code[i] k u_hap[k]
0 0.05000 0 0 1 2
1 0.05000 1 2 1 7
2 0.05000 2 4 2 1
3 0.05000 3 6 2 3
4 0.05000 4 8 3 6
5 0.05000 5 10 3 7
6 0.15000 6 12 4 1
7 0.15000 7 14 4 3
8 0.05000 8 16 5 4
9 0.05000 9 18 5 5
10 0.05000 10 20 6 4
11 0.05000 11 22 6 5
12 0.05000 12 24 6 6
13 0.05000 13 26 6 7
14 0.05000 14 28 7 2
15 0.05000 15 30 7 7
hap1 hap2
5 12
4 13
7 2
3 6
0 15
1 14
7 6
8 11
10 9
END inside C program
PASSED
tests/test_Haplostats.py::test_Haplostats_Simple3 START inside C program
n_loci: 3
i hap_prob[i] u_hap_code[i] k u_hap[k]
0 0.11000 7 0 1 3 2
1 0.01000 11 3 1 3 9
2 0.01000 14 6 1 3 20
3 0.00500 17 9 1 5 3
4 0.04000 19 12 1 5 9
5 0.00500 20 15 1 5 14
6 0.01000 27 18 2 1 5
7 0.06000 37 21 2 4 9
8 0.02000 40 24 2 4 17
9 0.02000 42 27 2 5 1
10 0.01000 46 30 2 5 17
11 0.00250 47 33 2 6 12
12 0.00250 48 36 2 6 13
13 0.01000 50 39 2 6 18
14 0.01000 53 42 2 8 5
15 0.01000 54 45 2 8 6
16 0.02000 55 48 2 8 7
17 0.01000 57 51 2 8 11
18 0.00250 58 54 2 8 12
19 0.00250 59 57 2 8 13
20 0.01000 64 60 2 9 9
21 0.01000 73 63 2 10 19
22 0.01000 85 66 3 4 10
23 0.01000 86 69 3 4 12
24 0.01000 87 72 3 4 13
25 0.05000 88 75 3 4 16
26 0.01000 94 78 3 8 18
27 0.00500 99 81 4 5 3
28 0.00500 100 84 4 5 14
29 0.01000 103 87 4 10 15
30 0.00500 106 90 5 4 6
31 0.00500 107 93 5 4 16
32 0.00000 108 96 5 6 1
33 0.01000 109 99 5 6 4
34 0.00500 110 102 5 6 6
35 0.00250 111 105 5 6 12
36 0.00250 112 108 5 6 13
37 0.00500 113 111 5 6 16
38 0.00250 114 114 5 8 12
39 0.00250 115 117 5 8 13
40 0.01000 119 120 6 1 5
41 0.04000 120 123 6 1 6
42 0.01000 121 126 6 1 12
43 0.01000 122 129 6 1 15
44 0.01000 123 132 6 1 16
45 0.01000 132 135 6 7 7
46 0.01000 136 138 7 2 5
47 0.01000 143 141 8 10 13
48 0.01000 144 144 9 8 9
49 0.09000 149 147 10 2 1
50 0.03000 151 150 10 2 4
51 0.01000 152 153 10 2 5
52 0.01000 153 156 10 2 6
53 0.02000 155 159 10 2 9
54 0.01000 177 162 10 6 6
55 0.01000 184 165 10 9 12
56 0.01000 194 168 11 2 6
57 0.01000 205 171 11 9 1
58 0.01000 208 174 11 9 8
59 0.03000 209 177 11 9 9
60 0.02000 211 180 11 9 13
61 0.03000 214 183 11 9 18
62 0.00000 215 186 11 9 20
63 0.01000 218 189 12 2 1
64 0.01000 227 192 12 9 1
65 0.01000 229 195 12 9 14
66 0.01000 230 198 12 9 16
67 0.01000 232 201 12 9 19
hap1 hap2
40 94
149 19
27 214
42 19
47 115
58 112
48 114
59 111
7 19
232 149
144 64
149 194
119 120
37 88
152 87
214 209
88 149
205 121
120 184
46 211
73 149
149 109
108 151
40 230
136 37
7 11
7 123
106 113
110 107
218 86
227 120
214 55
229 153
53 7
57 149
42 103
208 88
120 151
177 7
54 55
88 211
149 132
209 37
37 37
209 14
11 215
155 149
17 100
20 99
7 7
85 155
50 88
19 151
7 7
7 122
151 7
37 143
END inside C program
PASSED
tests/test_Haplostats.py::test_Haplostats_PyPopStringMatrix START inside C program
n_loci: 2
i hap_prob[i] u_hap_code[i] k u_hap[k]
0 0.05000 0 0 1 2
1 0.05000 1 2 1 7
2 0.05000 2 4 2 1
3 0.05000 3 6 2 3
4 0.05000 4 8 3 6
5 0.05000 5 10 3 7
6 0.15000 6 12 4 1
7 0.15000 7 14 4 3
8 0.05000 8 16 5 4
9 0.05000 9 18 5 5
10 0.05000 10 20 6 4
11 0.05000 11 22 6 5
12 0.05000 12 24 6 6
13 0.05000 13 26 6 7
14 0.05000 14 28 7 2
15 0.05000 15 30 7 7
hap1 hap2
5 12
4 13
7 2
3 6
0 15
1 14
7 6
8 11
10 9
END inside C program
PASSED
================================================================ 3 passed in 1.78 seconds ================================================================
Excellent, so we know now it works on MacOS as well.
@sjmack try a new git pull
in the haplo-stats
branch and try the same test. You can also try ./haplo-stats/haplo-test.py
which more or less does the same thing. We can now reconstruct the haplotypes using the original identifiers.
If you have the R haplo.stats
package installed, you should be able to reproduce the same values with the following commands:
library(haplo.stats)
control = haplo.em.control(n.try=1)
data(hla.demo)
attach(hla.demo)
geno = hla.demo[1:5,c(21:24)]
label <-c("DRB","B")
save.em <- haplo.em(geno=geno, locus.label=label, control=control)
The haplotypes are in the save.em$haplotype
variable (and other similar variables).
Okay. here are the results of py.test -s -v tests/test_Haplostats.py
================================================================== test session starts ===================================================================
platform darwin -- Python 2.7.13, pytest-3.1.2, py-1.4.34, pluggy-0.4.0 -- /opt/local/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python
cachedir: .cache
rootdir: /Applications/PyPop/pypop, inifile:
collected 3 items
tests/test_Haplostats.py::test_Haplostats_Simple START inside C program
n_loci: 2
i hap_prob[i] u_hap_code[i] k u_hap[k]
0 0.05000 0 0 1 2
1 0.05000 1 2 1 7
2 0.05000 2 4 2 1
3 0.05000 3 6 2 3
4 0.05000 4 8 3 6
5 0.05000 5 10 3 7
6 0.15000 6 12 4 1
7 0.15000 7 14 4 3
8 0.05000 8 16 5 4
9 0.05000 9 18 5 5
10 0.05000 10 20 6 4
11 0.05000 11 22 6 5
12 0.05000 12 24 6 6
13 0.05000 13 26 6 7
14 0.05000 14 28 7 2
15 0.05000 15 30 7 7
hap1 hap2
5 12
4 13
7 2
3 6
0 15
1 14
7 6
8 11
10 9
END inside C program
PASSED
tests/test_Haplostats.py::test_Haplostats_Simple3 START inside C program
n_loci: 3
i hap_prob[i] u_hap_code[i] k u_hap[k]
0 0.11000 7 0 1 3 2
1 0.01000 11 3 1 3 9
2 0.01000 14 6 1 3 20
3 0.00500 17 9 1 5 3
4 0.04000 19 12 1 5 9
5 0.00500 20 15 1 5 14
6 0.01000 27 18 2 1 5
7 0.06000 37 21 2 4 9
8 0.02000 40 24 2 4 17
9 0.02000 42 27 2 5 1
10 0.01000 46 30 2 5 17
11 0.00250 47 33 2 6 12
12 0.00250 48 36 2 6 13
13 0.01000 50 39 2 6 18
14 0.01000 53 42 2 8 5
15 0.01000 54 45 2 8 6
16 0.02000 55 48 2 8 7
17 0.01000 57 51 2 8 11
18 0.00250 58 54 2 8 12
19 0.00250 59 57 2 8 13
20 0.01000 64 60 2 9 9
21 0.01000 73 63 2 10 19
22 0.01000 85 66 3 4 10
23 0.01000 86 69 3 4 12
24 0.01000 87 72 3 4 13
25 0.05000 88 75 3 4 16
26 0.01000 94 78 3 8 18
27 0.00500 99 81 4 5 3
28 0.00500 100 84 4 5 14
29 0.01000 103 87 4 10 15
30 0.00500 106 90 5 4 6
31 0.00500 107 93 5 4 16
32 0.00000 108 96 5 6 1
33 0.01000 109 99 5 6 4
34 0.00500 110 102 5 6 6
35 0.00250 111 105 5 6 12
36 0.00250 112 108 5 6 13
37 0.00500 113 111 5 6 16
38 0.00250 114 114 5 8 12
39 0.00250 115 117 5 8 13
40 0.01000 119 120 6 1 5
41 0.04000 120 123 6 1 6
42 0.01000 121 126 6 1 12
43 0.01000 122 129 6 1 15
44 0.01000 123 132 6 1 16
45 0.01000 132 135 6 7 7
46 0.01000 136 138 7 2 5
47 0.01000 143 141 8 10 13
48 0.01000 144 144 9 8 9
49 0.09000 149 147 10 2 1
50 0.03000 151 150 10 2 4
51 0.01000 152 153 10 2 5
52 0.01000 153 156 10 2 6
53 0.02000 155 159 10 2 9
54 0.01000 177 162 10 6 6
55 0.01000 184 165 10 9 12
56 0.01000 194 168 11 2 6
57 0.01000 205 171 11 9 1
58 0.01000 208 174 11 9 8
59 0.03000 209 177 11 9 9
60 0.02000 211 180 11 9 13
61 0.03000 214 183 11 9 18
62 0.00000 215 186 11 9 20
63 0.01000 218 189 12 2 1
64 0.01000 227 192 12 9 1
65 0.01000 229 195 12 9 14
66 0.01000 230 198 12 9 16
67 0.01000 232 201 12 9 19
hap1 hap2
40 94
149 19
27 214
42 19
47 115
58 112
48 114
59 111
7 19
232 149
144 64
149 194
119 120
37 88
152 87
214 209
88 149
205 121
120 184
46 211
73 149
149 109
108 151
40 230
136 37
7 11
7 123
106 113
110 107
218 86
227 120
214 55
229 153
53 7
57 149
42 103
208 88
120 151
177 7
54 55
88 211
149 132
209 37
37 37
209 14
11 215
155 149
17 100
20 99
7 7
85 155
50 88
19 151
7 7
7 122
151 7
37 143
END inside C program
PASSED
tests/test_Haplostats.py::test_Haplostats_PyPopStringMatrix unique_alleles: ['1', '2', '4', '7', '8', '11', '13']
unique_alleles: ['7', '27', '44', '51', '55', '61', '62']
START inside C program
n_loci: 2
i hap_prob[i] u_hap_code[i] k u_hap[k]
0 0.05000 0 0 1 2
1 0.05000 1 2 1 7
2 0.05000 2 4 2 1
3 0.05000 3 6 2 3
4 0.05000 4 8 3 6
5 0.05000 5 10 3 7
6 0.15000 6 12 4 1
7 0.15000 7 14 4 3
8 0.05000 8 16 5 4
9 0.05000 9 18 5 5
10 0.05000 10 20 6 4
11 0.05000 11 22 6 5
12 0.05000 12 24 6 6
13 0.05000 13 26 6 7
14 0.05000 14 28 7 2
15 0.05000 15 30 7 7
hap1 hap2
5 12
4 13
7 2
3 6
0 15
1 14
7 6
8 11
10 9
END inside C program
[['1' '27']
['1' '62']
['2' '7']
['2' '44']
['4' '61']
['4' '62']
['7' '7']
['7' '44']
['8' '51']
['8' '55']
['11' '51']
['11' '55']
['11' '61']
['11' '62']
['13' '27']
['13' '62']]
[[ 1 6 13]
[ 1 5 14]
[ 2 8 3]
[ 2 4 7]
[ 3 1 16]
[ 3 2 15]
[ 4 8 7]
[ 5 9 12]
[ 5 11 10]]
FAILED
======================================================================== FAILURES ========================================================================
___________________________________________________________ test_Haplostats_PyPopStringMatrix ____________________________________________________________
def test_Haplostats_PyPopStringMatrix():
"""
This is the same numerical example as test_Haplostats_Simple()
except we are setting up via PyPop StringMatrix, and letting the class
handle all the translation into the low-level variables for the wrapper
"""
from PyPop.Utils import StringMatrix
from PyPop.Haplo import Haplostats
import numpy
import numpy.testing
control = {'max_iter': 5000,
'min_posterior': 0.000000001,
'tol': 0.00001,
'insert_batch_size': 2,
'random_start': 0,
'verbose': 0,
'max_haps_limit': 10000 }
geno = StringMatrix(5, ["DRB", "B"])
geno[0, 'DRB'] = ('4', '11')
geno[1, 'DRB'] = ('2', '7')
geno[2, 'DRB'] = ('1', '13')
geno[3, 'DRB'] = ('7', '7')
geno[4, 'DRB'] = ('8', '11')
geno[0, 'B'] = ('62', '61')
geno[1, 'B'] = ('7', '44')
geno[2, 'B'] = ('27', '62')
geno[3, 'B'] = ('7', '44')
geno[4, 'B'] = ('51', '55')
haplo = Haplostats(geno)
converge, lnlike, n_u_hap, n_hap_pairs, hap_prob, u_hap, u_hap_code, subj_id, post, hap1_code, hap2_code, haplotype = \
haplo.estHaplotypes(weight=None, control=control)
assert converge == 1
assert lnlike == -20.42316124449607
assert n_u_hap == 16
assert n_hap_pairs == 9
assert hap_prob == [0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.15, 0.15, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05]
assert u_hap == [1, 2, 1, 7, 2, 1, 2, 3, 3, 6, 3, 7, 4, 1, 4, 3, 5, 4, 5, 5, 6, 4, 6, 5, 6, 6, 6, 7, 7, 2, 7, 7]
assert u_hap_code == [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
assert subj_id == [1, 1, 2, 2, 3, 3, 4, 5, 5]
assert post == [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1.0, 0.5, 0.5]
> assert hap1_code == [6, 5, 3, 4, 1, 2, 7, 9, 10]
E assert [6, 5, 8, 4, 1, 2, ...] == [6, 5, 3, 4, 1, 2, ...]
E At index 2 diff: 8 != 3
E Full diff:
E - [6, 5, 8, 4, 1, 2, 8, 9, 11]
E ? ^ ^ ^
E + [6, 5, 3, 4, 1, 2, 7, 9, 10]
E ? ^ ^ ^
tests/test_Haplostats.py:175: AssertionError
=========================================================== 1 failed, 2 passed in 1.33 seconds ===========================================================
Here are the results of ./haplo-stats/haplo-test.py
unique_alleles: ['1', '2', '4', '7', '8', '11', '13']
unique_alleles: ['7', '27', '44', '51', '55', '61', '62']
START inside C program
n_loci: 2
i hap_prob[i] u_hap_code[i] k u_hap[k]
0 0.05000 0 0 1 2
1 0.05000 1 2 1 7
2 0.05000 2 4 2 1
3 0.05000 3 6 2 3
4 0.05000 4 8 3 6
5 0.05000 5 10 3 7
6 0.15000 6 12 4 1
7 0.15000 7 14 4 3
8 0.05000 8 16 5 4
9 0.05000 9 18 5 5
10 0.05000 10 20 6 4
11 0.05000 11 22 6 5
12 0.05000 12 24 6 6
13 0.05000 13 26 6 7
14 0.05000 14 28 7 2
15 0.05000 15 30 7 7
hap1 hap2
5 12
4 13
7 2
3 6
0 15
1 14
7 6
8 11
10 9
END inside C program
[['1' '27']
['1' '62']
['2' '7']
['2' '44']
['4' '61']
['4' '62']
['7' '7']
['7' '44']
['8' '51']
['8' '55']
['11' '51']
['11' '55']
['11' '61']
['11' '62']
['13' '27']
['13' '62']]
[[ 1 6 13]
[ 1 5 14]
[ 2 8 3]
[ 2 4 7]
[ 3 1 16]
[ 3 2 15]
[ 4 8 7]
[ 5 9 12]
[ 5 11 10]]
converge: 1
lnlike: -20.4231612445
n_u_hap: 16
n_hap_pairs: 9
hap_prob: [0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.15, 0.15, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05]
u_hap: [1, 2, 1, 7, 2, 1, 2, 3, 3, 6, 3, 7, 4, 1, 4, 3, 5, 4, 5, 5, 6, 4, 6, 5, 6, 6, 6, 7, 7, 2, 7, 7]
u_hap_code: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
subj_id: [1, 1, 2, 2, 3, 3, 4, 5, 5]
post: [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1.0, 0.5, 0.5]
hap1_code: [6, 5, 8, 4, 1, 2, 8, 9, 11]
hap2_code: [13, 14, 3, 7, 16, 15, 7, 12, 10]
hap_prob u_hap_code u_hap(needs to be split for printing)
[[ 0.05 1. ]
[ 0.05 2. ]
[ 0.05 3. ]
[ 0.05 4. ]
[ 0.05 5. ]
[ 0.05 6. ]
[ 0.15 7. ]
[ 0.15 8. ]
[ 0.05 9. ]
[ 0.05 10. ]
[ 0.05 11. ]
[ 0.05 12. ]
[ 0.05 13. ]
[ 0.05 14. ]
[ 0.05 15. ]
[ 0.05 16. ]]
subj_id hap1_code hap2_code
[[ 1 6 13]
[ 1 5 14]
[ 2 8 3]
[ 2 4 7]
[ 3 1 16]
[ 3 2 15]
[ 4 8 7]
[ 5 9 12]
[ 5 11 10]]
That looks the same as save.em in R
> save.em$haplotype
DRB B
1 1 27
2 1 62
3 2 7
4 2 44
5 4 61
6 4 62
7 7 7
8 7 44
9 8 51
10 8 55
11 11 51
12 11 55
13 11 61
14 11 62
15 13 27
16 13 62
> save.em$subj.id
[1] 1 1 2 2 3 3 4 5 5
> save.em$hap1code
[1] 6 5 8 4 1 2 8 9 11
> save.em$hap2code
[1] 13 14 3 7 16 15 7 12 10
Interesting. It's consistent with between Python and R on the same platform. But it's not consistent cross-platform. For example on Linux, my output for those R commands matches my Python test case:
> save.em$haplotype
DRB B
1 1 27
2 1 62
3 2 7
4 2 44
5 4 61
6 4 62
7 7 7
8 7 44
9 8 51
10 8 55
11 11 51
12 11 55
13 11 61
14 11 62
15 13 27
16 13 62
> save.em$subj.id
[1] 1 1 2 2 3 3 4 5 5
> save.em$hap1code
[1] 6 5 3 4 1 2 7 9 10
> save.em$hap2code
[1] 13 14 8 7 16 15 8 12 11
It only affects the hap1code, hap2code, at least in this particular example. I suspect there might be some subtle platform-dependent effect, either from the random number seed or in how it handles integers.
In any case, it's an issue that affects the haplo.stats
R package itself between platforms, so PyPop isn't introducing any additional problems. Maybe @rsingle will have some insights here. It's probably worth reporting upstream to the haplo.stats developers if they're not already aware of it.
I should try installing PyPop on a Windows machine to see if we get still different results.
In my windows machine, with R haplo.stats I get these results:
> save.em$haplotype
DRB B
1 1 27
2 1 62
3 2 7
4 2 44
5 4 61
6 4 62
7 7 7
8 7 44
9 8 51
10 8 55
11 11 51
12 11 55
13 11 61
14 11 62
15 13 27
16 13 62
> save.em$subj.id
[1] 1 1 2 2 3 3 4 5 5
> save.em$hap1code
[1] 6 14 8 4 1 15 7 12 11
> save.em$hap2code
[1] 13 5 3 7 16 2 8 9 10
So here again are a different assignment ordering.
It isn't random; here's a re-run on my Mac install of R:
> save.em$haplotype
DRB B
1 1 27
2 1 62
3 2 7
4 2 44
5 4 61
6 4 62
7 7 7
8 7 44
9 8 51
10 8 55
11 11 51
12 11 55
13 11 61
14 11 62
15 13 27
16 13 62
> save.em$subj.id
[1] 1 1 2 2 3 3 4 5 5
> save.em$hap1code
[1] 6 5 8 4 1 2 8 9 11
> save.em$hap2code
[1] 13 14 3 7 16 15 7 12 10
Yes, it definitely a platform-specific issue and it's in the R code itself, not PyPop per se. So I'm not going to spend too much more effort investigating it right now, and I'm going to continue on the integration (adding the .ini section, adding to the XML output etc), but will make a note of it. I suspect it may not matter much, so long as the actual haplotype frequencies and probabilities, etc. are consistent from platform to platform. It might make sense to remove hap1code and hap2code test for the moment.
@sjmack OK, I've started converting the internal data structures into the XML format. Try running: ./haplo-stats/haplo-test.py
You should see something like the following at the end of the output:
<group loci="DRB:B" showHaplo="yes" mode="all-pairwise-ld-no-permu">
<haplocount>16</haplocount>
<loglikelihood role="no-ld">-20.4231612445</loglikelihood>
<haplotypefreq>
<condition role="converged"></condition>
<haplotype name="1~27"><frequency>0.05</frequency></haplotype>
<haplotype name="1~62"><frequency>0.05</frequency></haplotype>
<haplotype name="2~7"><frequency>0.05</frequency></haplotype>
<haplotype name="2~44"><frequency>0.05</frequency></haplotype>
<haplotype name="4~61"><frequency>0.05</frequency></haplotype>
<haplotype name="4~62"><frequency>0.05</frequency></haplotype>
<haplotype name="7~7"><frequency>0.15</frequency></haplotype>
<haplotype name="7~44"><frequency>0.15</frequency></haplotype>
<haplotype name="8~51"><frequency>0.05</frequency></haplotype>
<haplotype name="8~55"><frequency>0.05</frequency></haplotype>
<haplotype name="11~51"><frequency>0.05</frequency></haplotype>
<haplotype name="11~55"><frequency>0.05</frequency></haplotype>
<haplotype name="11~61"><frequency>0.05</frequency></haplotype>
<haplotype name="11~62"><frequency>0.05</frequency></haplotype>
<haplotype name="13~27"><frequency>0.05</frequency></haplotype>
<haplotype name="13~62"><frequency>0.05</frequency></haplotype>
</haplotypefreq>
</group>
The idea is to make sure I'm mapping the correct values over and decide what is and isn't appropriate to port from emhaplofreq XML output. For comparison here is a sample XML from the UchiTelle population (I've trimmed some of the stuff in the various sections to keep it short):
<group mode="all-pairwise-ld-no-permu" loci="A:C" showHaplo="no">
<individcount role="before-filtering">10</individcount>
<individcount role="after-filtering">8</individcount>
<uniquepheno>8</uniquepheno>
<uniquegeno>14</uniquegeno>
<haplocount>26</haplocount>
<loglikelihood role="no-ld">-50.350627</loglikelihood>
<iterationsummary>
<![CDATA[
--- Iteration Summary for Original Data -------------------------------------------
Init. condition 0: Log likelihood after 8 iterations: -38.123095, error_flag: 0
]]></iterationsummary>
<haplotypefreq>
<loginfo><![CDATA[
Percent of iterations with error_flag = 0: 100.000
Percent of iterations with error_flag = 2: 0.000
7: Log likelihood failed to converge in 400 iterations
-----------------------------------------------------------------------------------
]]></loginfo>
<condition role="converged"/>
<iterConverged>22</iterConverged><loglikelihood>-36.043653</loglikelihood>
<haplotype name="0210~0102"><frequency>0.12500</frequency><numCopies>2.0</numCopies></haplotype>
<haplotype name="2501~0307"><frequency>0.12500</frequency><numCopies>2.0</numCopies></haplotype>
<haplotype name="03012~0712"><frequency>0.06250</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="0101~0804"><frequency>0.06250</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="0218~1202"><frequency>0.06250</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="0201~1507"><frequency>0.06250</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="3204~1801"><frequency>0.06250</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="03012~0605"><frequency>0.06250</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="3204~1507"><frequency>0.06250</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="3204~0307"><frequency>0.06250</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="0201~0102"><frequency>0.06250</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="6814~0712"><frequency>0.06250</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="0201~02025"><frequency>0.06250</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="0201~1202"><frequency>0.06250</frequency><numCopies>1.0</numCopies></haplotype>
</haplotypefreq>
e.g do we want to capture all the the info in the CDATA sections (at least in the cases that there is a haplo.stats-equivalent)? For example, does the uniquegeno
and the uniquepheno
still make sense in this context?
Here's what I get for./haplo-stats/haplo-test.py
:
unique_alleles: ['1', '2', '4', '7', '8', '11', '13']
unique_alleles: ['7', '27', '44', '51', '55', '61', '62']
START inside C program
n_loci: 2
i hap_prob[i] u_hap_code[i] k u_hap[k]
0 0.05000 0 0 1 2
1 0.05000 1 2 1 7
2 0.05000 2 4 2 1
3 0.05000 3 6 2 3
4 0.05000 4 8 3 6
5 0.05000 5 10 3 7
6 0.15000 6 12 4 1
7 0.15000 7 14 4 3
8 0.05000 8 16 5 4
9 0.05000 9 18 5 5
10 0.05000 10 20 6 4
11 0.05000 11 22 6 5
12 0.05000 12 24 6 6
13 0.05000 13 26 6 7
14 0.05000 14 28 7 2
15 0.05000 15 30 7 7
hap1 hap2
5 12
4 13
7 2
3 6
0 15
1 14
7 6
8 11
10 9
END inside C program
[['1' '27']
['1' '62']
['2' '7']
['2' '44']
['4' '61']
['4' '62']
['7' '7']
['7' '44']
['8' '51']
['8' '55']
['11' '51']
['11' '55']
['11' '61']
['11' '62']
['13' '27']
['13' '62']]
[[ 1 6 13]
[ 1 5 14]
[ 2 8 3]
[ 2 4 7]
[ 3 1 16]
[ 3 2 15]
[ 4 8 7]
[ 5 9 12]
[ 5 11 10]]
converge: 1
lnlike: -20.4231612445
n_u_hap: 16
n_hap_pairs: 9
hap_prob: [0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.15, 0.15, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05]
u_hap: [1, 2, 1, 7, 2, 1, 2, 3, 3, 6, 3, 7, 4, 1, 4, 3, 5, 4, 5, 5, 6, 4, 6, 5, 6, 6, 6, 7, 7, 2, 7, 7]
u_hap_code: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
subj_id: [1, 1, 2, 2, 3, 3, 4, 5, 5]
post: [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1.0, 0.5, 0.5]
hap1_code: [6, 5, 8, 4, 1, 2, 8, 9, 11]
hap2_code: [13, 14, 3, 7, 16, 15, 7, 12, 10]
hap_prob u_hap_code u_hap(needs to be split for printing)
[[ 0.05 1. ]
[ 0.05 2. ]
[ 0.05 3. ]
[ 0.05 4. ]
[ 0.05 5. ]
[ 0.05 6. ]
[ 0.15 7. ]
[ 0.15 8. ]
[ 0.05 9. ]
[ 0.05 10. ]
[ 0.05 11. ]
[ 0.05 12. ]
[ 0.05 13. ]
[ 0.05 14. ]
[ 0.05 15. ]
[ 0.05 16. ]]
subj_id hap1_code hap2_code
[[ 1 6 13]
[ 1 5 14]
[ 2 8 3]
[ 2 4 7]
[ 3 1 16]
[ 3 2 15]
[ 4 8 7]
[ 5 9 12]
[ 5 11 10]]
sample XML output
<group loci="DRB:B" showHaplo="yes" mode="all-pairwise-ld-no-permu">
<uniquegeno>9</uniquegeno>
<haplocount>16</haplocount>
<loglikelihood role="no-ld">-20.4232</loglikelihood>
<haplotypefreq>
<condition role="converged"></condition>
<haplotype name="1~27"><frequency>0.05</frequency></haplotype>
<haplotype name="1~62"><frequency>0.05</frequency></haplotype>
<haplotype name="2~7"><frequency>0.05</frequency></haplotype>
<haplotype name="2~44"><frequency>0.05</frequency></haplotype>
<haplotype name="4~61"><frequency>0.05</frequency></haplotype>
<haplotype name="4~62"><frequency>0.05</frequency></haplotype>
<haplotype name="7~7"><frequency>0.15</frequency></haplotype>
<haplotype name="7~44"><frequency>0.15</frequency></haplotype>
<haplotype name="8~51"><frequency>0.05</frequency></haplotype>
<haplotype name="8~55"><frequency>0.05</frequency></haplotype>
<haplotype name="11~51"><frequency>0.05</frequency></haplotype>
<haplotype name="11~55"><frequency>0.05</frequency></haplotype>
<haplotype name="11~61"><frequency>0.05</frequency></haplotype>
<haplotype name="11~62"><frequency>0.05</frequency></haplotype>
<haplotype name="13~27"><frequency>0.05</frequency></haplotype>
<haplotype name="13~62"><frequency>0.05</frequency></haplotype>
</haplotypefreq>
</group>
I think the CDATA summaries are important -- we should keep them. I also think that keeping/reporting the per-subject haplotype assignments is important too, and would increase PyPop's utility.
Great, excellent to see it's working for you.
For CDATA
, just to be clear I am keeping them for the emhaplofreq
case. For haplo-stats
, the logging info it does is quite different, and there is not nearly as much of it, and due to the way we interact with the program, isn't as easy to export to the XML. So, at least at first, there will probably not be equivalent information.
On this subject, given the current resources, I think we should start outlining what should be in the haplo-stats output. I propose we consider three (or maybe more) categories:
minimal goals: this consists of the information needed for basic "minimal" functionality that can be achieved within the remaining project chunk. This would include the XML tags and their information. At the moment, I am working on completing all the information (where it makes sense) under the <haplotypefreq>
XML section, with the haplotype frequencies as show in the example above. It would have the same ability to choose the loci to estimate haplotypes on, and the same .ini file semantics, i.e. lociToEstHaplo
. Here we would also spell out the expected ini file commands.
stretch goals: additional output/information and functionality. Currently it's not clear to me how the LD information will be output by the haplo-stats
code, and how this interacts with the haplotype frequency estimation. I think @rsingle mentioned it might needed to be added as fresh code (i.e there is no R code to follow as an example). So for the moment I would put the all the "LD" options such as allPairwiseLD
into this category.
future goals: nice-to-have for the medium/long-term; e.g. extra metadata like the above CDATA and additional functionality that goes beyond mimicing the options in emhaplofreq
would go here. Since these are more enhancement features, we would probably open up each as new github issue so we can keep track of them.
Also, currently the logic for how selecting specified loci, allPairwise options, and LD options all interact together are somewhat convoluted, both in terms of implementation, as well as user documentation. It would be good to revisit that logic and clean it up a bit. Since we're likely to end up deprecating the emhaplofreq
module, it might make sense to start fresh on the names of the options (if they should change) and their semantics of the ini file for haplo-stats
now, rather than attempting to mimic the entire existing .ini file syntax that is used for the emhaplofreq
module. Classifying the features we need into some kind of typology like I gave above would be a good way to start this process.
I was sort of hurrying when I posted my previous comment in the thread, and I wasn't looking as closely as I should have at what was in CDATA. What was focusing on is quoted below, which is actually before the CDATA.
<individcount role="before-filtering">10</individcount>
<individcount role="after-filtering">8</individcount>
<uniquepheno>8</uniquepheno>
<uniquegeno>14</uniquegeno>
<haplocount>26</haplocount>
<loglikelihood role="no-ld">-50.350627</loglikelihood>
So the emhaplofreq
CDATA cognates are fine.
For stretch goal 2, since @rsingle already is involved, and R dependency is already being built into PyPop, we may want to consider incorporating his assymLD R package (and future iterations that pull in more LD measures).
OK, thanks for the clarification on the CDATA
. Some of these just don't have haplo.stats
cognates, but will port those that do.
As I mentioned, doing the assymLD
probably not feasible just yet because we don't use R directly. haplo.stats only wraps the C module.
I have been able to remove the max_haps = 18
hardcoded limit, and implemented the computation of the number of possible haplotype pairs in Python (this was in the R geno.count.pairs
function). This should clear a lot of hurdles for using arbitrary data sizes. A lot of the complexity in the R version seems to be there because of the handling of missing data in the input genotype matrix. Currently rows with missing data (i.e. ****
) in PyPop are removed before sending to emhaplofreq
(as Rich points out, emhaplofreq
doesn't understand missing data), so we should probably do the same here. This allows us to simplify the code considerably, since we don't need to handle all the complexities of missing data. I've put a note in the code to revisit this if necessary later on.
Yes. Its very important to remove the missing data. For BIGDAWG (which uses haplo.stats) the table of all haplotype possibilities grows rapidly as the # missing increases, and we recommend # missing <=2 (preferably 0). Forcing it to be 0 for PyPop makes sense.
@sjmack great, good to clarify that, it will make the module implementation considerably simpler.
For all following, a preliminary pairwise LD estimation is now in haplo-stats
branch. There are corresponding unit tests in test_Haplostats.py
This info is not yet output to the XML/txt file. After that the main piece now is to get the logic in place to allow the estHaplotypes()
function to be called on arbitrary subsets of the original input. Currently it only works on the entire input file, to do this requires some internal rejiggery of data structures.
@sjmack and @rsingle: new version in haplo-stats
branch to test that can be run end-to-end via the bin/pypop.py
script. It's a little rough overall, but it works. This implements two .ini options (under the [Haplostats]
section): lociToEstHaplo
which works more-or-less as the version in Emhaplofreq
, and allPairwise
which does pairwise LD.
git checkout haplo-stats
./setup.py test
To test a simple run, with a two-locus data set, try:
./bin/pypop.py -c tests/data/Test_Small_Haplostats.ini tests/data/Test_Small_Haplostats.pop
This will do both the full data set and allPairwise (which in this particular case, is the same thing). A 3-locus dataset try:
./bin/pypop.py -c tests/data/Test_Larger_Haplostats.ini tests/data/Test_Larger_Haplostats.pop
This will do all-pairs of loci (3 in this case) as well the haplotype estimation for the 3-locus haplotypes.
Caveats:
StringMatrix
jiggery-pokery, so for the moment, make sure no rows have missing data. This will be the next priority item after you've kicked the tires on this.d_ij
stuff is not yet integratedAfter 3) is done and looks OK, I'd like to merge to master and start testing this in earnest. It would be good if the Test_Larger_Haplostats.pop
file could be modified to match a real example so we can do more unit testing on it. I just took the haplo.stats
example we've been using (originally from the hla.demo
package) and made up an extra C
locus with bogus data.
This looks good and output for LD matches from other programs. I've tried some larger files, but it is hard to compare exact LD stats. I think this is because there is only one starting condition being used currently (and other programs use multiple - thus slightly different estimates).
This is not an issue for the smaller datasets where there are probably no local maxima for the log likelihood.
Rich
On 8/17/2017 11:19 PM, Alex Lancaster wrote:
@sjmack https://github.com/sjmack and @rsingle https://github.com/rsingle: new version in |haplo-stats| branch to test that can be run end-to-end via the |bin/pypop.py| script. This implements two options: |lociToEstHaplo| which works more-or-less as the version in |Emhaplofreq|, and |allPairwise| which does pairwise LD.
|git checkout haplo-stats ./setup.py test |
To test a simple run, with a two-locus data set, try:
|./bin/pypop.py -c tests/data/Test_Small_Haplostats.ini tests/data/Test_Small_Haplostats.pop |
This will do both the full data set and allPairwise (which in this particular case, is the same thing). A 3-locus dataset try:
|./bin/pypop.py -c tests/data/Test_Larger_Haplostats.ini tests/data/Test_Larger_Haplostats.pop |
This will do all-pairs of loci (3 in this case) as well the haplotype estimation for the 3-locus haplotypes.
Caveats:
- the output from the command-line is a mess, it sends lots of debugging stuff to the console, ignore for the moment
- the text output works but there are duplicate entries in some of the output tables, this is because sometimes we are repeating the analyses, there is some complicated logic for the emhaplofreq which I'm loathe to duplicate here. The main thing to check is whether the values are all there.
- this does /not/ yet handle missing data by removing rows in any given submatrix that have missing data, this requires more |StringMatrix| jiggery-pokery, so for the moment, make sure no rows have missing data. This will be the next priority item after you've kicked the tires on this.
- |d_ij| stuff is not yet integrated
- probably a heap more stuff I'm forgetting
After 3) is done and looks OK, I'd like to merge to master and start testing this in earnest. It would be good if the |Test_Larger_Haplostats.pop| file could be modified to match a real example so we can do more unit testing on it. I just took the |haplo.stats| example we've been using (originally from the |hla.demo| package) and made up an extra |C| locus with bogus data.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alexlancaster/pypop/issues/28#issuecomment-323249099, or mute the thread https://github.com/notifications/unsubscribe-auth/AFdJcnkKzqHlcPEoXTZa6-JzJbeHeqljks5sZQK_gaJpZM4OfvJ7.
As a follow-up ... below is the summary of iterations from one of the larger files I was running. You can see that the max (-76.729274) is achieved from a few starting conditions, but there are lots of local maxima. Comparing results from init.cond 0 to init.cond 4 for LD will give some lack of agreement (general trends will be similar though).
Init. condition 0: LL after 15 iters: -85.047040, error_flag: 0 Init. condition 1: LL after 76 iters: -77.252522, error_flag: 0 Init. condition 2: LL after 15 iters: -80.025105, error_flag: 0 Init. condition 3: LL after 61 iters: -78.638816, error_flag: 0 Init. condition 4: LL after 12 iters: -76.729274, error_flag: 0 Init. condition 5: LL after 13 iters: -78.638816, error_flag: 0 Init. condition 6: LL after 12 iters: -84.707242, error_flag: 0 Init. condition 7: LL after 23 iters: -76.729274, error_flag: 0 Init. condition 8: LL after 17 iters: -80.888127, error_flag: 0 Init. condition 9: LL after 14 iters: -77.252522, error_flag: 0 Init. condition 10: LL after 15 iters: -78.638816, error_flag: 0 ...
On 8/18/2017 10:55 AM, Richard M. Single wrote:
This looks good and output for LD matches from other programs. I've tried some larger files, but it is hard to compare exact LD stats. I think this is because there is only one starting condition being used currently (and other programs use multiple - thus slightly different estimates).
This is not an issue for the smaller datasets where there are probably no local maxima for the log likelihood.
Rich
On 8/17/2017 11:19 PM, Alex Lancaster wrote:
@sjmack https://github.com/sjmack and @rsingle https://github.com/rsingle: new version in |haplo-stats| branch to test that can be run end-to-end via the |bin/pypop.py| script. This implements two options: |lociToEstHaplo| which works more-or-less as the version in |Emhaplofreq|, and |allPairwise| which does pairwise LD.
|git checkout haplo-stats ./setup.py test |
To test a simple run, with a two-locus data set, try:
|./bin/pypop.py -c tests/data/Test_Small_Haplostats.ini tests/data/Test_Small_Haplostats.pop |
This will do both the full data set and allPairwise (which in this particular case, is the same thing). A 3-locus dataset try:
|./bin/pypop.py -c tests/data/Test_Larger_Haplostats.ini tests/data/Test_Larger_Haplostats.pop |
This will do all-pairs of loci (3 in this case) as well the haplotype estimation for the 3-locus haplotypes.
Caveats:
- the output from the command-line is a mess, it sends lots of debugging stuff to the console, ignore for the moment
- the text output works but there are duplicate entries in some of the output tables, this is because sometimes we are repeating the analyses, there is some complicated logic for the emhaplofreq which I'm loathe to duplicate here. The main thing to check is whether the values are all there.
- this does /not/ yet handle missing data by removing rows in any given submatrix that have missing data, this requires more |StringMatrix| jiggery-pokery, so for the moment, make sure no rows have missing data. This will be the next priority item after you've kicked the tires on this.
- |d_ij| stuff is not yet integrated
- probably a heap more stuff I'm forgetting
After 3) is done and looks OK, I'd like to merge to master and start testing this in earnest. It would be good if the |Test_Larger_Haplostats.pop| file could be modified to match a real example so we can do more unit testing on it. I just took the |haplo.stats| example we've been using (originally from the |hla.demo| package) and made up an extra |C| locus with bogus data.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alexlancaster/pypop/issues/28#issuecomment-323249099, or mute the thread https://github.com/notifications/unsubscribe-auth/AFdJcnkKzqHlcPEoXTZa6-JzJbeHeqljks5sZQK_gaJpZM4OfvJ7.
Yes, the number of iterations is set to 1 currently, since we've been mainly in testing mode. I will add the iterations as an .ini file option, but we should also have some kind of default, probably whatever haplo.stats
uses for easy comparison.
@rsingle OK, haplo.stats
seems to use 10 as the default, so I set that as the global default. Try a git pull. I also added an .ini file option so the following should work
[Haplostats]
allPairwise=1
numInitCond=5
I output this in the <iterConverged>5</iterConverged>
XML tag for the moment, although I know this isn't technically correct, I wanted to get it in there so we can check that it's done all the iterations, we can change it later. Does haplo.stats
have an equivalent of the number of iterations until convergence? It wasn't clear to me from the code.
Hi Alex,
That helps. I can match up the LD statistics now from running a test on Bluemoon. I do have some questions below.
Regarding iters, there is no record kept of the number of iters from haplostats. max.iter is passed as an argument to the .c program, but it is only used to bail out of iterations if they go on for too long. The actual # of iters is not returned.
I notice that there are 2 entries for LD stats (see output below) for the locus pair. I'm guessing that this corresponds to the first initial condition and then the results from the "best". Is that right? The 2nd line (D'=0.638008 Wn=0.602736) agree with emhaplofreq on Bluemoon.
Below the LD results is the screen output from running pypop. What are the 3 numbers printed for each row (there were 18 rows for this run) Are they the seeds for the random # generator maybe? I see the "found a better lnlikelihood! -1391.93797682" comment. That is the same LL as shown for the first row of LD stats (rounding to 2 decimal places). For testing it might help to have a few more decimal places. I'm guessing that the LLs are actually slightly different. There is a pretty big diff in Wn values for the 2.
rich@ubuntu:~/Documents/github/pypop$ ./bin/pypop.py -c emhaplofreq/Test_LargerPopId_Haplostats.ini emhaplofreq/test-guarani.pop LOG: Data file has no header data block 23455 13636 17703 12837 26012 11956 11730 23343 26651 found a better lnlikelihood! -1391.93797682 21829 25797 29118 16869 14218 25712 17220 13891 12363 12825 13287 16863 23780 18907 11446 21203 24220 22737 23987 21573 28827 23225 19469 13839 25545 15151 26443 13074 20115 19772 20621 15708 15035 11370 24345 12727 29129 21281 28240 15900 23370 22237 10985 23201 11131
On 8/18/2017 5:05 PM, Alex Lancaster wrote:
@rsingle https://github.com/rsingle OK, |haplo.stats| seems to use 10 as the default, so I set that as the global default. Try a git pull. I also added an .ini file option so the following should work
|[Haplostats] allPairwise=1 numInitCond=5 |
I output this in the |
5 | XML tag for the moment, although I know this isn't technically correct, I wanted to get it in there so we can check that it's done all the iterations, we can change it later. Does |haplo.stats| have an equivalent of the number of iterations until convergence? It wasn't clear to be from the code.— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alexlancaster/pypop/issues/28#issuecomment-323460574, or mute the thread https://github.com/notifications/unsubscribe-auth/AFdJcvGI6Up4wrMMrBhzYLG-DlKi8yxCks5sZfyLgaJpZM4OfvJ7.
The dupes are I think, because it's outputting info to the XML twice: once for the haplotypes and LD for the specified loci (which is "*" which resolves to the pairs), then again in all-pairwise mode for LD. This is what I was talking about in caveat 2 in https://github.com/alexlancaster/pypop/issues/28#issuecomment-323249099 they should be the same in principle, so I don't know why the stats are off so much. I'd run it several times without the all pairwise option just specifying the lociToEst
for any specific pair.
If we actually need the iterations from the C code, we could add it to the list of returned values, if you can find the right location they are generated. Although it could be a big timesink to do that, so how important is it to have that information for now? The found a better likelihood was debugging I put in to check the loop, check the Haplo.py
code there's a print statement there. In general you should have a look at the Haplo.py
code yourself and see if it duplicates the logic of the R equivalent and feel free to add further debug statements to see if it is doing what you expect.
Another thing, so the numInitCond
(which is currently done in the Python code) is different to the number of iterations (done within the C), is that right? The iterations is for the convergence of the EM algorithm itself, and the numInitCond
(n.tries
in the R code) is for the number of times we restart the EM algorithm, is that right? If so, since there's going to be a different number of iterations for each condition, which one would we keep, just the last one that resulted in an improved log-likelihood ratio, I assume?
Looking in the C code it does record the the number of iterations reached in the local variable n_iter
in haplo_em_pin()
, we could return this value from the function as well if need be.
You are right about the number of initial conditions (n.tries) vs. number of iters. If we did record the number of iters, we would want it to correspond to the best log likelihood. So it would mean keeping track of it for each initial condition and swapping the value each time you find a better log likelihood. That said, I think this is a lower priority, especially since it means changing the c code.
Rich
Sent from my iPad
On Aug 19, 2017, at 11:20 PM, Alex Lancaster notifications@github.com wrote:
Another thing, so the numInitCond (which is currently done in the Python code) is different to the number of iterations (done within the C), is that right? The iterations is for the convergence of the EM algorithm itself, and the numInitCond (n.tries in the R code) is for the number of times we restart the EM algorithm, is that right? If so, since there's going to be a different number of iterations for each condition, which one would we keep, just the last one that resulted in an improved log-likelihood ration?
Looking in the C code it does record the the number of iterations reached in the local variable n_iter in haplo_em_pin(), we could return this value from the function as well if need be.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
That makes sense and I could follow it in the Haplo.py code. I bumped up the numInitCond to 15 and it took a lot of tries until I found substantially different LD results for the 2 versions (lociToEstHaplo=* and allPairwise=1).
Rich
On 8/19/2017 11:10 PM, Alex Lancaster wrote:
The dupes are I think, because it's outputting info to the XML twice: once for the haplotypes and LD for the specified loci (which is "*" which resolves to the pairs), then again in all-pairwise mode for LD. This is what I was talking about in caveat 2 in #28 (comment) https://github.com/alexlancaster/pypop/issues/28#issuecomment-323249099 they should be the same in principle, so I don't know why the stats are off so much. I'd run it several times without the all pairwise option just specifying the |lociToEst| for any specific pair.
If we actually need the iterations from the C code, we could add it to the list of returned values, if you can find the right location they are generated. Although it could be a big timesink to do that, so how important is it to have that information for now? The found a better likelihood was debugging I put in to check the loop, check the |Haplo.py| code there's a print statement there. In general you should have a look at the |Haplo.py| code yourself and see if it duplicates the logic of the R equivalent and feel free to add further debug statements to see if it is doing what you expect.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alexlancaster/pypop/issues/28#issuecomment-323560580, or mute the thread https://github.com/notifications/unsubscribe-auth/AFdJcsxmBtAr8qMaArJf05W6co4YW18Xks5sZ6OVgaJpZM4OfvJ7.
@rsingle OK, I've implement a testMode
for calling Haplostats . The logic is: if calling in testMode
(which is now used in the unit tests), we fix random number seeds for initial as well as subsequent calls of haplo_em_pin
to be fixed series of integers otherwise we set random_start=0
for the first set of tries (which uses the input seed variables iseed1
, iseed2
, and iseed3
) and then random_start=1
for the subsequent tries (this should match the R code, which doesn't have testMode
).
You can use testMode
from the PyPop command-line by adding -m
as an option, e.g.:
./bin/pypop.py -m -c tests/data/Test_LargerPopId_Haplostats.ini emhaplofreq/test_nomiss.pop
In theory this means that all the output should be identical from run to run, if there are differences we should track them down to make sure we're running in a completely deterministic way.
I just did a git pull in haplo-stats, ran ./bin/pypop.py -c tests/data/Test_Small_Haplostats.ini tests/data/Test_Small_Haplostats.pop
, found a bunch of errors, and then switched back and forth between hallo-stats and the master, and discovered a new verison that fixed the errors! Nice!
OK, try again on haplo-stats
branch and re-run:
./setup.py test
I just updated unit tests a bit to check LD and ALD. I'm about to do a pull request to merge this version into master for testing and send an e-mail out. There will still be a fairly rough version needing more testing.
I made another small adjustment to fix the unit tests, you might to do a new git pull
.
Two major things of the new version:
locToEstHaplo=*
in a two locus file, or lociToEstHaplo=locus1:locus2
, don't use this together with allPairwise=1
in the same .ini file, otherwise it will run the analysis for the same loci set twice, and you may get different LD/haplotype values, which can introduce confusion. we'll figure out how to deal with this down the roadI'm afraid that there is still a problem with this HaploStats implementation. The bug that I thought had been addressed yesterday is back in the new master branch.
When I run ./bin/pypop.py -c tests/data/Test_Small_Haplostats.ini tests/data/Test_Small_Haplostats.pop
there are errors in the <haplostats>
haplotype frequency estimates (and the haplotypes are still duplicated in the out.txt and out.xml files).
The more critical error is illustrated below (from the out.txt):
Haplotype frequency est. for loci: DRB:B
----------------------------------------
Unique genotypes: 6
Number of haplotypes: 10
Loglikelihood under linkage equilibrium [ln(L_0)]:
Loglikelihood obtained via the EM algorithm [ln(L_1)]: -18.1738
Number of iterations before convergence:
Haplotypes sorted by name | Haplotypes sorted by frequency
haplotype frequency# copies| haplotype frequency# copies
11~51 0.1 | 7~44 0.1999994
11~61 0.1 | 7~7 0.1000005
13~27 0.1 | 1~62 0.1
1~62 0.1 | 4~62 0.1
2~44 5.0074985 | 8~55 0.1
2~7 0.0999994 | 11~51 0.1
4~62 0.1 | 11~61 0.1
7~44 0.1999994 | 13~27 0.1
7~7 0.1000005 | 2~7 0.0999994
8~55 0.1 | 2~44 5.0074985
First off, in the .pop file, there are only 10 possible haplotypes, because there are only 5 rows of genotype data.
DRB_1 DRB_2 B_1 B_2
4 11 62 61
2 7 7 44
1 13 27 62
7 7 7 44
8 11 51 55
However, there is only one DRB2 allele, so there should only be one DRB102~B haplotype, and yet there are two DRB*2 haplotypes reported (2~44 and 2~7).
More importantly, the frequency of 2~44 is 5.0074985, which is obviously an error.
Here are the <haplostats>
covered haplotypes from the .xml:
<haplostats>
<group loci="DRB:B" showHaplo="yes" mode="all-pairwise-ld-no-permu">
<uniquegeno>6</uniquegeno>
<haplocount>10</haplocount>
<haplotypefreq><numInitCond>10</numInitCond>
<loglikelihood role="no-ld">-18.1738</loglikelihood>
<condition role="converged"></condition>
<haplotype name="1~27"><frequency>0.1</frequency></haplotype>
<haplotype name="2~7"><frequency>5.5802122053e-07</frequency></haplotype>
<haplotype name="2~44"><frequency>0.0999994419788</frequency></haplotype>
<haplotype name="4~61"><frequency>0.1</frequency></haplotype>
<haplotype name="7~7"><frequency>0.199999441979</frequency></haplotype>
<haplotype name="7~44"><frequency>0.100000558021</frequency></haplotype>
<haplotype name="8~55"><frequency>0.1</frequency></haplotype>
<haplotype name="11~51"><frequency>0.1</frequency></haplotype>
<haplotype name="11~62"><frequency>0.1</frequency></haplotype>
<haplotype name="13~62"><frequency>0.1</frequency></haplotype>
</haplotypefreq>
Here are the <emhaplofreq>
converged haplotypes from the out.xml
<condition role="converged"/>
<iterConverged>24</iterConverged><loglikelihood>-18.173821</loglikelihood>
<haplotype name="7~7"><frequency>0.20000</frequency><numCopies>2.0</numCopies></haplotype>
<haplotype name="7~44"><frequency>0.10000</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="11~62"><frequency>0.10000</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="4~61"><frequency>0.10000</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="13~27"><frequency>0.10000</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="1~62"><frequency>0.10000</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="11~51"><frequency>0.10000</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="8~55"><frequency>0.10000</frequency><numCopies>1.0</numCopies></haplotype>
<haplotype name="2~44"><frequency>0.10000</frequency><numCopies>1.0</numCopies></haplotype>
The <haplostats>
frequency value for 2~44 of 5.0074985 is actually 5.0074985e-7, which when added to the frequency of 2~7 (0.0999994) sums to 0.1, which is the frequency reported in the converged <emhaplofreq>
block for 2~7. Compare this to a similar (but not identical) issue for the frequencies of 7~7 and 7~44.
Also, while the out.xt file suggests that haplotype counts (# copies) are going to be presented, they aren't (because they don't seem to be present in the <haplostats>
section of the .xml.
Should have reopened.
Hey Steve,
Can you open this up as a new bug? I'm closing this one because it's now integrated into master, so as a feature it's now part of PyPop, so this should be a new issue. And it's easier to manage new issues as individual discrete issues, rather than in this thread.
Thanks Alex
Closing.
On August 22, 2017 2:02:14 PM EDT, sjmack notifications@github.com wrote:
Should have reopened.
-- You are receiving this because you modified the open/close state. Reply to this email directly or view it on GitHub: https://github.com/alexlancaster/pypop/issues/28#issuecomment-324105461
Hello, Is the missing information taken care of already? or we need to remove the rows which contain missing data?
@rbpisupati, when missing data is defined in the .ini file (the default notation is "****"), it is automatically removed by the [Emhaplofreq] module. However, missing data is not currently removed when using the [Haplostats] module. Examples are provided in issue #41.
As an alternative to
emhaplofreq
which has limitations of number of loci and number of individuals @rsingle and I are working on thehaplo-stats
branch: https://github.com/alexlancaster/pypop/tree/haplo-stats to wrap the C modulehaplo_em_pin
for estimating haplotypes (originally part of the R packagehaplo.stats
and licensed under the GPL). @sjmack has previously testedhaplo.stats
outside the PyPop context.