Problem in rarefy.py - Githubissues

OrsonMM commented 7 years ago

Hi team ecopy

I used the function $ ep.rarefy(data.frame, 'rarefy') and it return me the error:

ecopy/diversity/rarefy.py:137: RuntimeWarning: invalid value encountered in divide rare_calc = np.sum(1 - comb(diff, size)/comb(N, size))

And using ep.rarefy(data.frame, 'rarecurve') return the error:

ecopy/diversity/rarefy.py:147: RuntimeWarning: invalid value encountered in double_scalars sBar = Sn - np.sum(comb(n-x, i))/comb(n, i)

Please what is your recommendation.

Auerilas commented 5 years ago

Can you provide me a reproducible example? What data are you using?

dlwatts commented 4 years ago

I'm having the same problem. I've got shotgun reads with counts that is structured identically to the BCI example (which runs without problem). The only difference is I've got a few million counts per row instead of hundreds.

Any ideas? Happy to share a subset.

Auerilas commented 4 years ago

Hi all, I'm sorry about the lack of responses or maintenance of the package. I started a new faculty job last year and my time to work on what is essentially a "free side project" dropped to 0. I've been working on a mechanism to rectify that, but it might be a bit before I can really devote time to getting EcoPy back up to speed.

dlwatts commented 4 years ago

Well, when you are ready, to give you an assist-- and if anyone else needs to see this-- the core problem is in factorials at large numbers. You can solve part of it via

def rareCurve_Func(i, Sn, n, x):
    sBar = Sn -  np.sum(comb(n-x, i))/comb(n, min(i, n-i), exact=False)     
    return sBar

But there is still an upper limit, where comb(n-x, i) can only provide inf values because we're in big o world. We can get partially around it by

def combination(N,k): # modified from scipy.comb()
    if (k > N) or (N < 0) or (k < 0):
        return 0
    N,k = map(int,(N,k))
    top = N
    val = 1
    while (top > (N-k)):
        val *= top
        top -= 1
    n = 1
    while (n < k+1):
        val //= n
        n += 1
    return val

(stack overflow is my friend) but if you run it on n= 204852.5 and i=184551.8 (rational values from shotgun DNA runs) then you get a number so large that it's obvious why np.sum(comb(n-x, i)) would not return a rational number. The ultimate answer, imo, is that the estimation approach has an upper limit in practicality.

gonzalezibeas commented 1 year ago

Hi, I have the same problem. I am able to reproduce the BCI example, but when I run the "rarefy" function with my data, I get an error:

RuntimeWarning: invalid value encountered in double_scalars sBar = Sn - np.sum(comb(n-x, i))/comb(n, i)

I am able to use the same data on another R package for rarefaction curves and it works. I think that the problem of Ecopy (v0.1.2.2) relies on the management of very large or very small numbers, as suggested by other people in this issue. There are some functions on the scipy.special library to deal with that. In my case, I think that the problem is due to high species richness in my data, almost reaching saturation.

Auerilas / ecopy

Problem in rarefy.py #13