estimate_observation_richness.py performance issues

jairideout commented 10 years ago

A user on the QIIME forum reported performance issues with estimate_observation_richness.py when using a small OTU table (tens of samples, hundreds of observations). The script was consuming a lot of memory and also running very slowly.

I was able to reproduce this on my end. The performance issues arise from the standard error calculation for extrapolation estimates (all other calcs should be fine). This algorithm is O(n**2) for both time and space, where n is the number of sequences in a sample (i.e. the sum of a column in the BIOM table). The equation in the original paper (Colwell et al. 2012, equation 10) is also O(n**2).

This serious performance issue slipped through when I originally wrote the script because I tested with tables that didn't have very high sampling depths (though the tables had many more samples and observations than the one provided on the forum). This was a Very Bad Thing :tm: and a huge mess-up on my part.

Fortunately, the script is marked as beta/experimental in the ChangeLog, and I don't think it's an in-demand QIIME script. It was originally written for use in a paper but we ended up not using the script. I'm pretty sure that no one has been using it since it was added several months ago, otherwise this performance issue would have arisen before the release.

I'm not sure what the best way forward is. There are alternative standard error estimators, such as bootstrapping (this is what iNEXT uses instead of the original equation, code here). I'm not sure how the performance will compare to the current approach, but it's worth looking into.

The original software that implements these equations (Robert Colwell's EstimateS) is closed-source, but from emailing the author, the algorithm is also O(n**2).

There are ways to make the existing estimate_observation_richness.py code more efficient, though I'm not sure how to solve the overall growth rate offhand.

Assigning to myself, though likely pretty low priority for now.

antgonza commented 10 years ago

Just out of curiosity, could you send on which line(s) of the script this is happening?

jairideout commented 10 years ago

Here's where the covariance matrix is computed:

https://github.com/qiime/qiime/blob/master/qiime/estimate_observation_richness.py#L437-L454

And here's where it's used to compute the stderr:

https://github.com/qiime/qiime/blob/master/qiime/estimate_observation_richness.py#L300-L332

antgonza commented 10 years ago

Cool, do you know if these will work for you? http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.gradient.html

http://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html

wasade commented 10 years ago

where the covariance matrix is computed:

suggest doing the i==j case external to the nested loop to avoid branching logic inside the loops. also, suggest using arange instead of range as arange returns numpy types and will reduce python/numpy battles. also, cast s_est to whatever the appropriate type is outside the first loop, declare result with the desired type, and specify the appropriate dtype when using arange.

i have a suspect this can actually be done in numpy with broadcast and avoid the loops but I could be thinking about this incorrectly

On Tue, Jan 7, 2014 at 11:47 AM, Jai Ram Rideout notifications@github.comwrote:

Here's where the covariance matrix is computed:

https://github.com/qiime/qiime/blob/master/qiime/estimate_observation_richness.py#L437-L454

And here's where it's used to compute the stderr:

https://github.com/qiime/qiime/blob/master/qiime/estimate_observation_richness.py#L300-L332

— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/1314#issuecomment-31766172 .

wdwvt1 commented 10 years ago

numpy has a covariance function called cov which i imagine is well optimized.

On Tue, Jan 7, 2014 at 11:47 AM, Jai Ram Rideout notifications@github.comwrote:

Here's where the covariance matrix is computed:

https://github.com/qiime/qiime/blob/master/qiime/estimate_observation_richness.py#L437-L454

And here's where it's used to compute the stderr:

https://github.com/qiime/qiime/blob/master/qiime/estimate_observation_richness.py#L300-L332

— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/1314#issuecomment-31766172 .

wasade commented 10 years ago

good call @wdwvt1 and @antgonza. I think np.cov can be used here and that would definitely be ideal if possible

rob-knight commented 10 years ago

Thanks for catching this. np.cov is highly optimized and uses atlas/blas where available. There are also parallel implementations of cov available in standard scientific libraries if it comes to that.

On Jan 7, 2014, at 11:57 AM, Daniel McDonald notifications@github.com<mailto:notifications@github.com> wrote:

good call @wdwvt1https://github.com/wdwvt1 and @antgonzahttps://github.com/antgonza. I think np.cov can be used here and that would definitely be ideal if possible

— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/1314#issuecomment-31767180.

jairideout commented 10 years ago

Thanks for the help everyone! I'll look into this and post updates here.

biocore / qiime

estimate_observation_richness.py performance issues #1314