NeurodataWithoutBorders / lindi

Linked Data Interface (LINDI) - cloud-friendly access to NWB data
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

slow get_chunk_info => need link to hdf5 mechanism #6

Closed magland closed 5 months ago

magland commented 5 months ago

It appears that the low-level get_chunk_info() function in hdf5 is extremely inefficient when a dataset has a very large number of chunks - at least in the case where the h5 file is remote. Here's a script that demonstrates it hanging when the number of chunks is ~ 1.3 million

https://github.com/NeurodataWithoutBorders/lindi/blob/jfm/scratch/dev1/demonstrate_slow_get_chunk_info.py

It's surprising to me how inefficient that lookup is. I looked into the C code of hdf5 and I can see it is iterating over all the chunks!

This is another reason why it's impractical to create a chunk reference to a dataset with many chunks on a remote h5 file. Of course, this could be less problematic if we had the file locally. But during development, that's not feasible.

So as we discussed, this points to the imperative of creating a link to the array in a remote file. We need to come up with a standardized way to do this. Here's a possibility

/path/to/dataset
  .zattrs
  .zarray
  .link
  # Notice that the chunk files are absent

And then .link is a json file with

{
  "link_type": "hdf5_object",
  "url": "...",
  "name": "..."
}

Open to revisions/suggestions

@rly @oruebel

magland commented 5 months ago

Thinking more about this, I don't think it's a good idea to add additional files that don't conform to zarr specification. Maybe we need to put a special attribute on the dataset.