Open jairideout opened 10 years ago
Just out of curiosity, could you send on which line(s) of the script this is happening?
Here's where the covariance matrix is computed:
https://github.com/qiime/qiime/blob/master/qiime/estimate_observation_richness.py#L437-L454
And here's where it's used to compute the stderr:
https://github.com/qiime/qiime/blob/master/qiime/estimate_observation_richness.py#L300-L332
Cool, do you know if these will work for you? http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.gradient.html
http://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html
where the covariance matrix is computed:
suggest doing the i==j case external to the nested loop to avoid branching logic inside the loops. also, suggest using arange instead of range as arange returns numpy types and will reduce python/numpy battles. also, cast s_est to whatever the appropriate type is outside the first loop, declare result with the desired type, and specify the appropriate dtype when using arange.
i have a suspect this can actually be done in numpy with broadcast and avoid the loops but I could be thinking about this incorrectly
On Tue, Jan 7, 2014 at 11:47 AM, Jai Ram Rideout notifications@github.comwrote:
Here's where the covariance matrix is computed:
https://github.com/qiime/qiime/blob/master/qiime/estimate_observation_richness.py#L437-L454
And here's where it's used to compute the stderr:
https://github.com/qiime/qiime/blob/master/qiime/estimate_observation_richness.py#L300-L332
— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/1314#issuecomment-31766172 .
numpy has a covariance function called cov which i imagine is well optimized.
On Tue, Jan 7, 2014 at 11:47 AM, Jai Ram Rideout notifications@github.comwrote:
Here's where the covariance matrix is computed:
https://github.com/qiime/qiime/blob/master/qiime/estimate_observation_richness.py#L437-L454
And here's where it's used to compute the stderr:
https://github.com/qiime/qiime/blob/master/qiime/estimate_observation_richness.py#L300-L332
— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/1314#issuecomment-31766172 .
good call @wdwvt1 and @antgonza. I think np.cov
can be used here and that would definitely be ideal if possible
Thanks for catching this. np.cov is highly optimized and uses atlas/blas where available. There are also parallel implementations of cov available in standard scientific libraries if it comes to that.
On Jan 7, 2014, at 11:57 AM, Daniel McDonald notifications@github.com<mailto:notifications@github.com> wrote:
good call @wdwvt1https://github.com/wdwvt1 and @antgonzahttps://github.com/antgonza. I think np.cov can be used here and that would definitely be ideal if possible
— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/1314#issuecomment-31767180.
Thanks for the help everyone! I'll look into this and post updates here.
A user on the QIIME forum reported performance issues with estimate_observation_richness.py when using a small OTU table (tens of samples, hundreds of observations). The script was consuming a lot of memory and also running very slowly.
I was able to reproduce this on my end. The performance issues arise from the standard error calculation for extrapolation estimates (all other calcs should be fine). This algorithm is
O(n**2)
for both time and space, wheren
is the number of sequences in a sample (i.e. the sum of a column in the BIOM table). The equation in the original paper (Colwell et al. 2012, equation 10) is alsoO(n**2)
.This serious performance issue slipped through when I originally wrote the script because I tested with tables that didn't have very high sampling depths (though the tables had many more samples and observations than the one provided on the forum). This was a Very Bad Thing :tm: and a huge mess-up on my part.
Fortunately, the script is marked as beta/experimental in the ChangeLog, and I don't think it's an in-demand QIIME script. It was originally written for use in a paper but we ended up not using the script. I'm pretty sure that no one has been using it since it was added several months ago, otherwise this performance issue would have arisen before the release.
I'm not sure what the best way forward is. There are alternative standard error estimators, such as bootstrapping (this is what iNEXT uses instead of the original equation, code here). I'm not sure how the performance will compare to the current approach, but it's worth looking into.
The original software that implements these equations (Robert Colwell's EstimateS) is closed-source, but from emailing the author, the algorithm is also
O(n**2)
.There are ways to make the existing estimate_observation_richness.py code more efficient, though I'm not sure how to solve the overall growth rate offhand.
Assigning to myself, though likely pretty low priority for now.