MoBiodiv / mobr

Tools for analyzing changes in diversity across scales
Other
23 stars 18 forks source link

rarefaction breaks due to memory issues when community is very large #166

Closed dmcglinn closed 6 years ago

dmcglinn commented 6 years ago

Specifically, the call rarefaction(sad, 'indiv') will return an error like Error: cannot allocate vector of size 24.7 Gb if the sad is of a size that is often seen in microbial communities S > 20K and N > 100K. This is because by default one does not need to specify the amount of effort that rarefaction is computed for so rarefied richness for all values of n from 1 to 100K are computed. This can be easily fixed by specifying reasonable values for the effort such as:

rarefaction(sad, 'indiv', effort = c(2^seq(0, log2(sum(abu))), sum(abu)))

which effectively only evaluates abundances at log2 increments along between 1 and N. Although the argument log_scale is implemented in the function get_delta_stats in an attempt to employ this kind of effort increment when the delta curves for the individual rarefaction result is computed every single abundance is used. Essentially I'm proposing to make two changes to the code base:

  1. provide a log_scale argument in the function rarefaction so that log increments in effort can be more easily computed. I think it would also be worthwhile to set the default behavior to use a log binning increment for individual based rarefaction which has a very well behaved shape generally. I do not think binning rules need to be imposed for the spatial rarefaction curve because there shouldn't be a large computational cost for computing these sample-based curves for all possible sample sizes.

  2. get_delta_stats should be more careful to avoid computing individual rarefaction curves for all value from 1:N. In other words the arguments passed to get_delta_stats by the user should actually be implemented in the downstream functions.

Let me know if you have any thoughts here. I'm glad a learned of this bug because it general it will make the package faster to implement these changes and also of course extendable to larger microbial community datasets.

rueuntal commented 6 years ago

That all sounds great to me. Thanks @dmcglinn for catching this!

dmcglinn commented 6 years ago

This is same issue is also at play in #204. The best solution is to specify log_scale = TRUE and /or to specify the argument inds