biocore / biom-format

The Biological Observation Matrix (BIOM) Format Project
http://biom-format.org
Other
89 stars 95 forks source link

PERF: use sparse matrix for table.collapse(..., one_to_many=True) #884

Closed wasade closed 1 year ago

wasade commented 1 year ago

Avoid dense representation on one_to_many. This code path was implicated in woltka collapse.

Note this is a WIP, I am asserting the memory reduction in woltka collapse right now. However, the high memory requirement is nearly certainly due to the dense memory representation previously used.

cc @antgonza @qiyunzhu

wasade commented 1 year ago

Test failure is due to doc build with Sphinx on py3.7, attempting to work through it.

wasade commented 1 year ago

...okay, now waiting on resources to test this.

@qiyunzhu, Antonio is out-of-office for a while. Is there any chance you could do a review of this PR? The change is relatively straightforward: instead of aggregating data in a dense numpy matrix, we aggregate into a "dict of keys" sparse matrix followed by conversion to compressed sparse column.

wasade commented 1 year ago

The current method supports divide, and no problem for BIOM. Good call on float64, just pushed a change for that.

The divide may trigger an underflow rather than an overflow. In which case, numpy would warn:

In [2]: v = np.float64(1e1000)

In [3]: v /= 1e1000
<ipython-input-3-4eac3ab25e8a>:1: RuntimeWarning: invalid value encountered in double_scalars
  v /= 1e1000

In [4]: v
Out[4]: nan
wasade commented 1 year ago

Thanks! My test is still running so will hold off merge for the time being