CodyCBakerPhD commented 2 years ago

It has been known since the original implementation of the GenericDataChunkIterator that the default buffer shape method (https://github.com/hdmf-dev/hdmf/blob/dev/src/hdmf/data_utils.py#L310-L329) given a target buffer size were a tad inefficient with extremely lopsided distributions of axes lengths (i.e., if one axis is many orders of magnitude longer than any others). The main justification of being OK with this was because we allowed manual specification of the buffer_shape which gives the user full control over calculating such things.

So far this has not been too severe; statistically speaking, with respect to the set of permutations of 2D arrays ranging from 1,000-10,000 x 1,000-100,000, the default method is able to specify a buffer shape within at most a 5% error from the target buffer_gb size (so only 50 MB off). However, when comparing 2D arrays with one axis in the range of ~1,000 and the other around ~100,000,000, the resulting size of the buffer shapes tend to vastly underestimate the target.

With respect to our most common use case of this iterator (ecephys traces) this essentially means that single probe SpikeGLX recordings of more than 30 minutes in duration will start to experience decreased speeds with iterative write (unless, of course, a hard-coded buffer shape was originally specified).

Anyway, this PR introduces some heuristics for buffer 'padding' that produce equivalent performance in the previous axis scaling regimes but fixes the issue for very disproportionate ratios by filling the buffer with as many contiguous chunks along an axis as possible.

A follow-up to this improvement to allow partial manual size specification of shape parameters (requested by catalystneuro/neuroconv#17) will also help in these kinds of cases by allowing the user to specify the semantic equivalent of 'load as much of these axes that I've set to None as possible'.

CodyCBakerPhD commented 2 years ago

@h-mayorquin Any idea why all your review comments got duplicated 2-4 times?

h-mayorquin commented 2 years ago

@h-mayorquin Any idea why all your review comments got duplicated 2-4 times?

Not really sure. Maybe it is because I have more than one tab opened when I submit the review.

CodyCBakerPhD commented 2 years ago

@h-mayorquin Check out latest changes reflecting your suggestions.

I also added the original one-shot calculation (unpadded), which can produce better results for more squarish shapes. The method will now use whichever shape produces a closer fill to the buffer size, while still not taking very many iterations to perform said estimation.

CodyCBakerPhD commented 2 years ago

and I don't know why I should expect those values

Based on the intercombinations of conditions that lead into those different blocks of the method,

test_buffer_padding_long_shape with shape (10**7, 20) -> results in shape (68482, 20) by using the entire second axis and partial fill on first (both steps of new method)

test_buffer_padding_mixed_shape with shape (20, 40, 2401) -> results in shape (16, 32, 1920) by using the original method of 'similar rectangle of smaller size'. This is also probably currently tested on the hdmf side so this test probably won't propagate.

test_min_axis_too_large with shape (1000, 100) but artificially low chunk to buffer size and ratio -> results in shape (20, 20) by using the new condition where no consecutive axis slice fits in the buffer (original method would have collapsed the buffer shape to a chunk shape)

h-mayorquin commented 2 years ago

and I don't know why I should expect those values

Based on the intercombinations of conditions that lead into those different blocks of the method,

test_buffer_padding_long_shape with shape (10**7, 20) -> results in shape (68482, 20) by using the entire second axis and partial fill on first (both steps of new method)

test_buffer_padding_mixed_shape with shape (20, 40, 2401) -> results in shape (16, 32, 1920) by using the original method of 'similar rectangle of smaller size'. This is also probably currently tested on the hdmf side so this test probably won't propagate.

test_min_axis_too_large with shape (1000, 100) but artificially low chunk to buffer size and ratio -> results in shape (20, 20) by using the new condition where no consecutive axis slice fits in the buffer (original method would have collapsed the buffer shape to a chunk shape)

Thanks for describing them. Maybe it would be a good idea to add them in the test docstrings? As this is moving to hdfm I don't know what standards they have over there.

codecov[bot] commented 2 years ago

Codecov Report

Merging #542 (30aec2c) into main (9d7f549) will increase coverage by 0.20%. The diff coverage is 100.00%.

:exclamation: Current head 30aec2c differs from pull request most recent head d1e4773. Consider uploading reports for the commit d1e4773 to get more accurate results

@@            Coverage Diff             @@
##             main     catalystneuro/nwb-conversion-tools#542      +/-   ##
==========================================
+ Coverage   87.87%   88.08%   +0.20%     
==========================================
  Files          59       59              
  Lines        3068     3105      +37     
==========================================
+ Hits         2696     2735      +39     
+ Misses        372      370       -2

Flag	Coverage Δ
unittests	`88.08% <100.00%> (+0.20%)`	:arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/nwb_conversion_tools/tools/hdmf.py	`100.00% <100.00%> (ø)`
...version_tools/tools/roiextractors/roiextractors.py	`79.83% <0.00%> (+0.15%)`	:arrow_up:

catalystneuro / nwb-conversion-tools

[Refactoring/Enhancement] Pad buffering in GenericDataChunkIterator #542

Codecov Report