NeurodataWithoutBorders / lindi

Linked Data Interface (LINDI) - cloud-friendly access to NWB data
BSD 3-Clause "New" or "Revised" License
1 stars 1 forks source link

Use chunk_iter when adding chunk refs #68

Closed rly closed 2 months ago

rly commented 2 months ago

Fix #67.

I added lindi.LindiH5ZarrStore._util._get_all_chunk_info.

LindiH5ZarrStore._get_chunk_file_bytes_data was being called in two places:

  1. in to_reference_file_system.process_dataset in a loop over all chunks to get the chunk info for each chunk to add to the refs dict
  2. by _get_chunk_file_bytes which is called by __getitem__ for a single key

Because the new chunk_iter method returns the chunk info for all chunks and takes 1-5 seconds, it is best to use that in use case 1 above. The old get_chunk_info method takes about 0.2 seconds. It is best to use that for use case 2 above when a user wants to access a single key.

TODO: if the user wants to access more than ~10 keys for a given dataset, e.g., they are requesting a large slice, then probably it would be best to use the chunk_iter method and cache the chunk info in LindiH5ZarrStore. Alternatively, we could just always run chunk_iter and cache the chunk info.

I tried putting a bunch of if/else branches in LindiH5ZarrStore._get_chunk_file_bytes_data to handle the two use cases above, but it got messy and complex. So I just created a new function _add_chunk_info_to_refs for use case 1 which has some of the same code as _get_chunk_file_bytes_data. It needs to be refactored.

This also adds tqdm as a dependency for showing a progress bar when iterating through many chunks. I think that is useful overall, but we could move that to an optional dependency if we think it is not so important.

@magland before I continue, can you take a look at this and let me know if I am understanding the code correctly and if the approach looks good?

TODO:

codecov-commenter commented 2 months ago

Codecov Report

Attention: Patch coverage is 82.75862% with 10 lines in your changes are missing coverage. Please review.

Project coverage is 79.53%. Comparing base (1fc62ef) to head (2897504).

Files Patch % Lines
lindi/LindiH5ZarrStore/_util.py 63.15% 7 Missing :warning:
lindi/LindiH5ZarrStore/LindiH5ZarrStore.py 92.30% 3 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #68 +/- ## ========================================== - Coverage 81.35% 79.53% -1.83% ========================================== Files 30 30 Lines 2194 2238 +44 ========================================== - Hits 1785 1780 -5 - Misses 409 458 +49 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

magland commented 2 months ago

Pushed some small changes to keep the linter happy (pyright / flake8). Passes tests now.

... continuing my review...

magland commented 2 months ago

@rly This all looks great

rly commented 2 months ago

@magland great! thanks for the look over

In https://github.com/NeurodataWithoutBorders/lindi/blob/1fc62ef070b9cb545704f69a71dad26acbdb6f7f/lindi/LindiH5ZarrStore/LindiH5ZarrStore.py#L573 when would chunks[i] ever be 0?

When I try to create an h5py dataset with a chunk dimension length = 0, I get an error:

with h5py.File("test.h5", "w") as f:
    f.create_dataset("data", shape=(2,2,2), dtype="i4", chunks=(1,1,0))
ValueError: All chunk dimensions must be positive (all chunk dimensions must be positive)
rly commented 2 months ago

Maybe related: the line above says chunks = zarray_dict.get("chunks", None) but the lines below assume that chunks is not None. Is chunks always set in the zarray dict? I think Zarr always sets chunks.

> root = zarr.open("test.zarr", mode="w")
> data = root.create_dataset("data", shape=(0, 0), chunks=None)
> data.chunks
(1, 1)
rly commented 2 months ago

Though in Zarr, you can explicitly set the chunk dimension lengths to 0...

> root = zarr.open("test.zarr", mode="w")
> data = root.create_dataset("data", shape=(0, 0), chunks=(0, 0))
> data.chunks
(0, 0)
magland commented 2 months ago

when would chunks[i] ever be 0?

I have the following in comment elsewhere in the code (it should probably be pasted above this line as well):

"the shape could be zero -- for example dandiset 000559 - acquisition/depth_video/data has shape [0, 0, 0]"

I'm not sure how the shape got to be that... but you can check that dandiset.

rly commented 2 months ago

"the shape could be zero -- for example dandiset 000559 - acquisition/depth_video/data has shape [0, 0, 0]" This dataset has shape = (0,0,0) and the chunks = None.

I just removed that chunk of code and wanted to make sure I wasn't missing an unusual edge case. I think the modification would work for this dataset, but I'll add a test.

rly commented 2 months ago

I just updated lindi/LindiH5ZarrStore/LindiH5ZarrStore.py so that when writing all chunks in to_reference_file_system, we no longer pre-compute the names or indices of the chunks. This means _add_chunk_info_to_refs no longer needs a list of key names; instead we generate them based on the metadata of each chunk. This removes the need for a lot of consistency checks in _add_chunk_info_to_refs. The order is the same as before.

I think a progress bar is useful. Annoyingly, dsid.get_num_chunks() is quite slow, and this is very noticeable when the dataset is remote, so I added _get_max_num_chunks to replace it for the progress bar. This count includes unallocated chunks. Rarely do NWB files have unallocated chunks. But in general, it provides, at worst, an overestimate of the remaining progress.

magland commented 2 months ago

I just updated lindi/LindiH5ZarrStore/LindiH5ZarrStore.py so that when writing all chunks in to_reference_file_system, we no longer pre-compute the names or indices of the chunks. This means _add_chunk_info_to_refs no longer needs a list of key names; instead we generate them based on the metadata of each chunk. This removes the need for a lot of consistency checks in _add_chunk_info_to_refs. The order is the same as before.

I think a progress bar is useful. Annoyingly, dsid.get_num_chunks() is quite slow, and this is very noticeable when the dataset is remote, so I added _get_max_num_chunks to replace it for the progress bar. This count includes unallocated chunks. Rarely do NWB files have unallocated chunks. But in general, it provides, at worst, an overestimate of the remaining progress.

This all sounds good to me.

magland commented 2 months ago

@rly is the ready to get merged in yet? Is there something I can help with?

rly commented 2 months ago

I wanted to add some tests but let's go ahead and merge this so that we can use it in the benchmarks for the proposal. We can add tests in a separate PR.

magland commented 2 months ago

Okay, I just published 0.3.5 from main branch.