It appears that the low-level get_chunk_info() function in hdf5 is extremely inefficient when a dataset has a very large number of chunks - at least in the case where the h5 file is remote. Here's a script that demonstrates it hanging when the number of chunks is ~ 1.3 million
It's surprising to me how inefficient that lookup is. I looked into the C code of hdf5 and I can see it is iterating over all the chunks!
This is another reason why it's impractical to create a chunk reference to a dataset with many chunks on a remote h5 file. Of course, this could be less problematic if we had the file locally. But during development, that's not feasible.
So as we discussed, this points to the imperative of creating a link to the array in a remote file. We need to come up with a standardized way to do this. Here's a possibility
/path/to/dataset
.zattrs
.zarray
.link
# Notice that the chunk files are absent
Thinking more about this, I don't think it's a good idea to add additional files that don't conform to zarr specification. Maybe we need to put a special attribute on the dataset.
It appears that the low-level
get_chunk_info()
function in hdf5 is extremely inefficient when a dataset has a very large number of chunks - at least in the case where the h5 file is remote. Here's a script that demonstrates it hanging when the number of chunks is ~ 1.3 millionhttps://github.com/NeurodataWithoutBorders/lindi/blob/jfm/scratch/dev1/demonstrate_slow_get_chunk_info.py
It's surprising to me how inefficient that lookup is. I looked into the C code of hdf5 and I can see it is iterating over all the chunks!
This is another reason why it's impractical to create a chunk reference to a dataset with many chunks on a remote h5 file. Of course, this could be less problematic if we had the file locally. But during development, that's not feasible.
So as we discussed, this points to the imperative of creating a link to the array in a remote file. We need to come up with a standardized way to do this. Here's a possibility
And then .link is a json file with
Open to revisions/suggestions
@rly @oruebel