Closed rly closed 2 months ago
Attention: Patch coverage is 82.75862%
with 10 lines
in your changes are missing coverage. Please review.
Project coverage is 79.53%. Comparing base (
1fc62ef
) to head (2897504
).
Files | Patch % | Lines |
---|---|---|
lindi/LindiH5ZarrStore/_util.py | 63.15% | 7 Missing :warning: |
lindi/LindiH5ZarrStore/LindiH5ZarrStore.py | 92.30% | 3 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Pushed some small changes to keep the linter happy (pyright / flake8). Passes tests now.
... continuing my review...
@rly This all looks great
@magland great! thanks for the look over
In https://github.com/NeurodataWithoutBorders/lindi/blob/1fc62ef070b9cb545704f69a71dad26acbdb6f7f/lindi/LindiH5ZarrStore/LindiH5ZarrStore.py#L573
when would chunks[i]
ever be 0?
When I try to create an h5py dataset with a chunk dimension length = 0, I get an error:
with h5py.File("test.h5", "w") as f:
f.create_dataset("data", shape=(2,2,2), dtype="i4", chunks=(1,1,0))
ValueError: All chunk dimensions must be positive (all chunk dimensions must be positive)
Maybe related: the line above says chunks = zarray_dict.get("chunks", None)
but the lines below assume that chunks
is not None. Is chunks
always set in the zarray dict? I think Zarr always sets chunks
.
> root = zarr.open("test.zarr", mode="w")
> data = root.create_dataset("data", shape=(0, 0), chunks=None)
> data.chunks
(1, 1)
Though in Zarr, you can explicitly set the chunk dimension lengths to 0...
> root = zarr.open("test.zarr", mode="w")
> data = root.create_dataset("data", shape=(0, 0), chunks=(0, 0))
> data.chunks
(0, 0)
when would
chunks[i]
ever be 0?
I have the following in comment elsewhere in the code (it should probably be pasted above this line as well):
"the shape could be zero -- for example dandiset 000559 - acquisition/depth_video/data has shape [0, 0, 0]"
I'm not sure how the shape got to be that... but you can check that dandiset.
"the shape could be zero -- for example dandiset 000559 - acquisition/depth_video/data has shape [0, 0, 0]" This dataset has shape =
(0,0,0)
and the chunks =None
.
I just removed that chunk of code and wanted to make sure I wasn't missing an unusual edge case. I think the modification would work for this dataset, but I'll add a test.
I just updated lindi/LindiH5ZarrStore/LindiH5ZarrStore.py
so that when writing all chunks in to_reference_file_system
, we no longer pre-compute the names or indices of the chunks. This means _add_chunk_info_to_refs
no longer needs a list of key names; instead we generate them based on the metadata of each chunk. This removes the need for a lot of consistency checks in _add_chunk_info_to_refs
. The order is the same as before.
I think a progress bar is useful. Annoyingly, dsid.get_num_chunks()
is quite slow, and this is very noticeable when the dataset is remote, so I added _get_max_num_chunks
to replace it for the progress bar. This count includes unallocated chunks. Rarely do NWB files have unallocated chunks. But in general, it provides, at worst, an overestimate of the remaining progress.
I just updated
lindi/LindiH5ZarrStore/LindiH5ZarrStore.py
so that when writing all chunks into_reference_file_system
, we no longer pre-compute the names or indices of the chunks. This means_add_chunk_info_to_refs
no longer needs a list of key names; instead we generate them based on the metadata of each chunk. This removes the need for a lot of consistency checks in_add_chunk_info_to_refs
. The order is the same as before.I think a progress bar is useful. Annoyingly,
dsid.get_num_chunks()
is quite slow, and this is very noticeable when the dataset is remote, so I added_get_max_num_chunks
to replace it for the progress bar. This count includes unallocated chunks. Rarely do NWB files have unallocated chunks. But in general, it provides, at worst, an overestimate of the remaining progress.
This all sounds good to me.
@rly is the ready to get merged in yet? Is there something I can help with?
I wanted to add some tests but let's go ahead and merge this so that we can use it in the benchmarks for the proposal. We can add tests in a separate PR.
Okay, I just published 0.3.5 from main branch.
Fix #67.
I added
lindi.LindiH5ZarrStore._util._get_all_chunk_info
.LindiH5ZarrStore._get_chunk_file_bytes_data
was being called in two places:to_reference_file_system.process_dataset
in a loop over all chunks to get the chunk info for each chunk to add to the refs dict_get_chunk_file_bytes
which is called by__getitem__
for a single keyBecause the new
chunk_iter
method returns the chunk info for all chunks and takes 1-5 seconds, it is best to use that in use case 1 above. The oldget_chunk_info
method takes about 0.2 seconds. It is best to use that for use case 2 above when a user wants to access a single key.TODO: if the user wants to access more than ~10 keys for a given dataset, e.g., they are requesting a large slice, then probably it would be best to use the
chunk_iter
method and cache the chunk info inLindiH5ZarrStore
. Alternatively, we could just always runchunk_iter
and cache the chunk info.I tried putting a bunch of if/else branches in
LindiH5ZarrStore._get_chunk_file_bytes_data
to handle the two use cases above, but it got messy and complex. So I just created a new function_add_chunk_info_to_refs
for use case 1 which has some of the same code as_get_chunk_file_bytes_data
. It needs to be refactored.This also adds
tqdm
as a dependency for showing a progress bar when iterating through many chunks. I think that is useful overall, but we could move that to an optional dependency if we think it is not so important.@magland before I continue, can you take a look at this and let me know if I am understanding the code correctly and if the approach looks good?
TODO: