Closed CodyCBakerPhD closed 2 years ago
@h-mayorquin Any idea why all your review comments got duplicated 2-4 times?
@h-mayorquin Any idea why all your review comments got duplicated 2-4 times?
Not really sure. Maybe it is because I have more than one tab opened when I submit the review.
@h-mayorquin Check out latest changes reflecting your suggestions.
I also added the original one-shot calculation (unpadded), which can produce better results for more squarish shapes. The method will now use whichever shape produces a closer fill to the buffer size, while still not taking very many iterations to perform said estimation.
and I don't know why I should expect those values
Based on the intercombinations of conditions that lead into those different blocks of the method,
test_buffer_padding_long_shape
with shape (10**7, 20)
-> results in shape (68482, 20)
by using the entire second axis and partial fill on first (both steps of new method)
test_buffer_padding_mixed_shape
with shape (20, 40, 2401)
-> results in shape (16, 32, 1920)
by using the original method of 'similar rectangle of smaller size'. This is also probably currently tested on the hdmf
side so this test probably won't propagate.
test_min_axis_too_large
with shape (1000, 100)
but artificially low chunk to buffer size and ratio -> results in shape (20, 20)
by using the new condition where no consecutive axis slice fits in the buffer (original method would have collapsed the buffer shape to a chunk shape)
and I don't know why I should expect those values
Based on the intercombinations of conditions that lead into those different blocks of the method,
test_buffer_padding_long_shape
with shape(10**7, 20)
-> results in shape(68482, 20)
by using the entire second axis and partial fill on first (both steps of new method)
test_buffer_padding_mixed_shape
with shape(20, 40, 2401)
-> results in shape(16, 32, 1920)
by using the original method of 'similar rectangle of smaller size'. This is also probably currently tested on thehdmf
side so this test probably won't propagate.
test_min_axis_too_large
with shape(1000, 100)
but artificially low chunk to buffer size and ratio -> results in shape(20, 20)
by using the new condition where no consecutive axis slice fits in the buffer (original method would have collapsed the buffer shape to a chunk shape)
Thanks for describing them. Maybe it would be a good idea to add them in the test docstrings? As this is moving to hdfm
I don't know what standards they have over there.
Merging #542 (30aec2c) into main (9d7f549) will increase coverage by
0.20%
. The diff coverage is100.00%
.:exclamation: Current head 30aec2c differs from pull request most recent head d1e4773. Consider uploading reports for the commit d1e4773 to get more accurate results
@@ Coverage Diff @@
## main catalystneuro/nwb-conversion-tools#542 +/- ##
==========================================
+ Coverage 87.87% 88.08% +0.20%
==========================================
Files 59 59
Lines 3068 3105 +37
==========================================
+ Hits 2696 2735 +39
+ Misses 372 370 -2
Flag | Coverage Δ | |
---|---|---|
unittests | 88.08% <100.00%> (+0.20%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files | Coverage Δ | |
---|---|---|
src/nwb_conversion_tools/tools/hdmf.py | 100.00% <100.00%> (ø) |
|
...version_tools/tools/roiextractors/roiextractors.py | 79.83% <0.00%> (+0.15%) |
:arrow_up: |
It has been known since the original implementation of the
GenericDataChunkIterator
that the default buffer shape method (https://github.com/hdmf-dev/hdmf/blob/dev/src/hdmf/data_utils.py#L310-L329) given a target buffer size were a tad inefficient with extremely lopsided distributions of axes lengths (i.e., if one axis is many orders of magnitude longer than any others). The main justification of being OK with this was because we allowed manual specification of thebuffer_shape
which gives the user full control over calculating such things.So far this has not been too severe; statistically speaking, with respect to the set of permutations of 2D arrays ranging from
1,000-10,000 x 1,000-100,000
, the default method is able to specify a buffer shape within at most a 5% error from the targetbuffer_gb
size (so only 50 MB off). However, when comparing 2D arrays with one axis in the range of~1,000
and the other around~100,000,000
, the resulting size of the buffer shapes tend to vastly underestimate the target.With respect to our most common use case of this iterator (
ecephys
traces) this essentially means that single probe SpikeGLX recordings of more than 30 minutes in duration will start to experience decreased speeds with iterative write (unless, of course, a hard-coded buffer shape was originally specified).Anyway, this PR introduces some heuristics for buffer 'padding' that produce equivalent performance in the previous axis scaling regimes but fixes the issue for very disproportionate ratios by filling the buffer with as many contiguous chunks along an axis as possible.
A follow-up to this improvement to allow partial manual size specification of shape parameters (requested by catalystneuro/neuroconv#17) will also help in these kinds of cases by allowing the user to specify the semantic equivalent of 'load as much of these axes that I've set to
None
as possible'.