PeteHaitch / DelayedMatrixStats

A port of the matrixStats API to work with DelayedMatrix objects from the DelayedArray package
Other
15 stars 7 forks source link

Use BLOCK_colVars/BLOCK_rowVars workhorses for colVars/rowVars methods #94

Closed hpages closed 1 year ago

hpages commented 1 year ago

Until now, that is, prior to DelayedMatrixStats 1.23.1, the colVars() and rowVars() methods for DelayedMatrix objects were using colblock_APPLY() and rowblock_APPLY() internally to handle block processing. These utilities use blocks made of full columns and full rows, respectively, regarless of the physical layout of the data on disk. However, this doesn't "play well" with some physical layouts. For example, loading full rows in memory is extremely inefficient in the case of a TENxMatrix object (from the HDF5Array package), because it triggers the loading of the entire dataset!

The new BLOCK_colVars() and BLOCK_rowVars() internal helpers implemented in the DelayedArray package address this by trying to choose a block geometry that "plays well" with the physical layout. By delegating the work to these functions, the colVars() and rowVars() methods for DelayedMatrix objects can be 3x to 10x faster (or more) for datasets with a "difficult" physical layout, while at the same time consume a lot less memory.

Note that the other matrixStats methods defined in DelayedMatrixStats also use colblock_APPLY() and rowblock_APPLY() internally, so will need to be modified in a similar way.

PeteHaitch commented 1 year ago

Thanks, Hervé. I'll incorporate once I re-sync with matrixStats v1.0 in devel.

PeteHaitch commented 1 year ago

Moving to https://github.com/PeteHaitch/DelayedMatrixStats/pull/97