NeurodataWithoutBorders / lindi

Linked Data Interface (LINDI) - cloud-friendly access to NWB data
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Use chunk_iter to avoid slow performance for files with many chunks #67

Closed rly closed 4 months ago

rly commented 4 months ago

When working with datasets with a large number of chunks and requesting that their chunk information be cached in the LINDI file (e.g., by setting num_dataset_chunks_threshold to a very large number or None), h5py/HDF5 performance for get_chunk_info for getting chunk locations and offsets is terrible. This was reported here: https://github.com/h5py/h5py/issues/2117

Since then, h5py 3.8 was released, which adds the method h5py.h5d.DatasetID.chunk_iter(), which is significantly faster at retrieving chunk information. This method works only for HDF5 1.12.3 and above. Unfortunately, the latest release of h5py on PyPI (3.11.0) for Mac includes HDF5 1.12.2. Pre-built packages on Linux & Windows bundle HDF5 version 1.14.2. To use this faster method on Mac, we have to install the latest h5py from conda-forge or build it from source with HDF5 1.12.3.

Use case: this nwb file in dandiset 000717 has just over 1 million chunks. Using the old method, getting the chunk info for the first 100 chunks takes about 15-23 seconds and the time appears to be roughly, but not exactly, linear in the number of chunks requested. If we assume it is linear, then to get chunk info for ALL chunks would take about 56 hours. Using the new method, getting the chunk info for ALL chunks takes about 1-6 seconds. The variation between 1 and 6 seconds might depend on hdf5 caching.

I suggest we use chunk_iter if available and fall back to get_chunk_info with a warning if there are a large number of chunks. I was going to suggest using tqdm to monitor getting the chunks, but I think given the 1-6 second speed of the new method, it is not necessary.

from tqdm import tqdm
import h5py
import timeit

url_or_path = "/Users/rly/Downloads/sub-R6_ses-20200206T210000_behavior+ophys.nwb"
with h5py.File(url_or_path, "r") as f:
    start_time = timeit.default_timer()
    h5_dataset = f["/acquisition/TwoPhotonSeries/data"]
    dsid = h5_dataset.id
    # for i in tqdm(range(100)):
    for i in range(100):
        chunk_info = dsid.get_chunk_info(i)

    end_time = timeit.default_timer()
    elapsed_time = end_time - start_time
    print(f'Time elapsed: {elapsed_time} seconds')
import h5py
import timeit

url_or_path = "/Users/rly/Downloads/sub-R6_ses-20200206T210000_behavior+ophys.nwb"
with h5py.File(url_or_path, "r") as f:
    start_time = timeit.default_timer()
    h5_dataset = f["/acquisition/TwoPhotonSeries/data"]
    dsid = h5_dataset.id
    stinfo = list()
    dsid.chunk_iter(stinfo.append)
    print(len(stinfo))
    print(stinfo[-1])

    end_time = timeit.default_timer()
    elapsed_time = end_time - start_time
    print(f'Time elapsed: {elapsed_time} seconds')
magland commented 4 months ago

I'm glad you found that alternative method! I had been looking for a solution to this for a while (even delving into the hdf5 source code), but was not able to discover that alternative.

If we assume it is linear, then to get chunk info for ALL chunks would take about 56 hours. Using the new method, getting the chunk info for ALL chunks takes about 1-6 seconds. The variation between 1 and 6 seconds might depend on hdf5 caching.

Incredible.

Okay so this should be high priority. Do you want me to take a crack at it?

rly commented 4 months ago

I started taking a crack at it. I'll let you know my progress in the morning. Might need some eyes on refactoring.