NeurodataWithoutBorders / lindi

Linked Data Interface (LINDI) - cloud-friendly access to NWB data
BSD 3-Clause "New" or "Revised" License
1 stars 1 forks source link

Reading single chunk takes 10x longer than remfile #74

Open rly opened 1 month ago

rly commented 1 month ago

Using remfile as below:

import remfile
import h5py
import pynwb
import timeit

# URL to HDF5 NWB file
s3_url = "https://dandiarchive.s3.amazonaws.com/blobs/fec/8a6/fec8a690-2ece-4437-8877-8a002ff8bd8a"
byte_stream = remfile.File(url=s3_url)
file = h5py.File(name=byte_stream)
io = pynwb.NWBHDF5IO(file=file)
nwbfile = io.read()
data_to_slice = nwbfile.acquisition["ElectricalSeriesAp"].data

start = timeit.default_timer()
data_to_slice[0:10,0:384]
end = timeit.default_timer()
print(end - start)

Takes 0.2 seconds on my laptop.

Using lindi as below:

import lindi
import pynwb
import timeit

# URL to LINDI JSON of NWB file
s3_url = "https://dandi-api-staging-dandisets.s3.amazonaws.com/blobs/914/6aa/9146aa46-9c01-45be-9d2a-693e6a7bb778"
client = lindi.LindiH5pyFile.from_lindi_file(url_or_path=s3_url)
io = pynwb.NWBHDF5IO(file=client)
nwbfile = io.read()
data_to_slice = nwbfile.acquisition["ElectricalSeriesAp"].data

start = timeit.default_timer()
data_to_slice[0:10,0:384]
end = timeit.default_timer()
print(end - start)

Takes 2.4 seconds on my laptop.

The data chunk size is (13653, 384) with no compression. Nothing stands out in the LINDI JSON. I'm not sure if I am doing something wrong or if there is an efficiency somewhere in the system.

I'll start looking into it. @magland, do you have any ideas about what might be going on?

magland commented 1 month ago

@rly

I think what's going on here...

h5py can read partial chunks - and in this case there is no compression so this is possible

whereas lindi/zarr is set up to always read entire chunks

According to the lindi.json file, the chunk size is [13653, 384]

Maybe this is a zarr limitation/constraint/feature?

rly commented 1 month ago

Ah, that makes sense. After changing the slice size to equal the chunk size, lindi is now only ~2x the speed of remfile. In inspecting the execution, it looks like zarr makes the request for key acquisition/ElectricalSeriesAp/data/0.0 twice. I'm trying to figure out why.

But also in digging through the Zarr code, I found that Zarr might be able to support partial reads: https://github.com/zarr-developers/zarr-python/blob/b1f4c509abaee1cb8dec18e3a973e1199226011a/src/zarr/v2/core.py#L2054-L2095

Right now, execution is going through the else because "get_partial_values" is not an attribute of LindiReferenceFileSystemStore.

magland commented 1 month ago

Ah. It will be good to figure out whether the duplicate request can be avoided... and/or whether we should implement some caching for this type of situation.

Do you think we should set the get_partial_values attribute somehow?

rly commented 1 month ago

Do you think we should set the get_partial_values attribute somehow?

Yeah, I think that would be nice, but not urgent. For most large reads, I think it would not make a big difference because the read will be mostly full chunks and some part of a chunk on each axis. And most big datasets are compressed.

If you have time, it would be great if you can take a look but no pressure. Otherwise, I'll try to take a look at it next week.

magland commented 1 month ago

Makes sense. I'm not going to work on it right now.