bug in hypergeometric likelihood / choose

fau-klue / pandas-association-measures

Statistical association measures for Python pandas

MIT License

8 stars 2 forks source link

bug in hypergeometric likelihood / choose #12

Closed ausgerechnet closed 4 years ago

ausgerechnet commented 4 years ago

hypergeometric likelihood seems reasonably implemented but only yields Inf values. Is there a bug in the choose function?

martialblog commented 4 years ago

I don't think so, but we could extend the tests for it. Currently:

def test_choose():
    assert bi.choose(9, 3) == 84
    assert bi.choose(5, 2) == 10
    assert bi.choose(2, 2) == 1
    assert bi.choose(0, 0) == 1
    assert bi.choose(-3, 5) == 0
    assert bi.choose(-2, -2) == 0
    assert bi.choose(10, -2) == 0

Can you show me how to recreate the issue?

ausgerechnet commented 4 years ago

bi.choose(100000, 15000) already returns negative values; for very large numbers return value is always 0. This is obviously a problem with the int declaration.

martialblog commented 4 years ago

Jup, type conversion issue. But I don't think the solution is a second Python implementation, since the Cython code be fixed and is way faster.

cpdef double cchoose(int n, int k):
    if k < 0:
        return 0
    if k == 0:
        return 1
    if n < k:
        return 0

    cdef double p = 1
    cdef int N = min(k, n - k) + 1
    cdef int i
    for i in range(1, N):
        p *= n
        p = p / i
        n -= 1
    return p

%time
# Python
for i in range(10000):
    choose(i*2, i)
CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.68 µs

%time
# Cython
for i in range(10000):
    cchoose(i*2, i)
CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 3.81 µs

ausgerechnet commented 4 years ago

I'd suggest to leave out the hypergeometric likelihood altogether. It's just way too slow to calculate. btw: python 3.8 ships with its own version of the binomial coefficient.

ausgerechnet commented 4 years ago

and I never suggested that the python implementation I hacked in would be a reasonable alternative – it was just for checking the correct result …