Closed rly closed 6 months ago
I'm glad you found that alternative method! I had been looking for a solution to this for a while (even delving into the hdf5 source code), but was not able to discover that alternative.
If we assume it is linear, then to get chunk info for ALL chunks would take about 56 hours. Using the new method, getting the chunk info for ALL chunks takes about 1-6 seconds. The variation between 1 and 6 seconds might depend on hdf5 caching.
Incredible.
Okay so this should be high priority. Do you want me to take a crack at it?
I started taking a crack at it. I'll let you know my progress in the morning. Might need some eyes on refactoring.
When working with datasets with a large number of chunks and requesting that their chunk information be cached in the LINDI file (e.g., by setting
num_dataset_chunks_threshold
to a very large number or None), h5py/HDF5 performance forget_chunk_info
for getting chunk locations and offsets is terrible. This was reported here: https://github.com/h5py/h5py/issues/2117Since then, h5py 3.8 was released, which adds the method
h5py.h5d.DatasetID.chunk_iter()
, which is significantly faster at retrieving chunk information. This method works only for HDF5 1.12.3 and above. Unfortunately, the latest release of h5py on PyPI (3.11.0) for Mac includes HDF5 1.12.2. Pre-built packages on Linux & Windows bundle HDF5 version 1.14.2. To use this faster method on Mac, we have to install the latest h5py from conda-forge or build it from source with HDF5 1.12.3.Use case: this nwb file in dandiset 000717 has just over 1 million chunks. Using the old method, getting the chunk info for the first 100 chunks takes about 15-23 seconds and the time appears to be roughly, but not exactly, linear in the number of chunks requested. If we assume it is linear, then to get chunk info for ALL chunks would take about 56 hours. Using the new method, getting the chunk info for ALL chunks takes about 1-6 seconds. The variation between 1 and 6 seconds might depend on hdf5 caching.
I suggest we use
chunk_iter
if available and fall back toget_chunk_info
with a warning if there are a large number of chunks. I was going to suggest usingtqdm
to monitor getting the chunks, but I think given the 1-6 second speed of the new method, it is not necessary.