NeurodataWithoutBorders / lindi

Linked Data Interface (LINDI) - cloud-friendly access to NWB data
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

[Idea] Experiment with multipart range requests for slicing uncompressed source data #72

Closed rly closed 4 months ago

rly commented 4 months ago

When accessing a slice of data from a remote data array where the arrangement of the requested elements is not contiguous, I believe LINDI will make a bunch of small requests. For example, let's say we have a flat binary file with a 2D array of float64 values (8 bytes), stored as time by channel, that is ordered by time and then by channel, i.e., all data for time 0 is stored at the beginning of the file, then data for time 1, etc. If we want to stream data for channel index c from time index i to time index j, then I believe LINDI/h5py would make j-i range requests of length 8 bytes, since this slice is not contiguous. If the data are chunked and the chunks are cached, this could result in fewer requests, but require downloading the whole chunk, which might include data not requested - a tradeoff.

Alternatively, LINDI could make a range request starting at [time i, channel 0] and ending at [time j, channel n-1], and then slice the data after download, but that means reading a lot of data unnecessarily. For uncompressed/unfiltered array data where the byte ranges for a given slice request should be computable, I wonder if we can do better.

I just discovered that HTTP supports multipart range requests. The format is e.g., "Range: bytes=0-50, 100-150". I wonder if this would allow you to effectively slice the binary array in any arbitrary dimension while making a single request, and if that is faster than the methods above.

Relatedly, Neuralynx, SpikeGadgets, and I think TDT, store raw binary data in groups (aka packets, records) as they are acquired. Each group of bytes represents the data received at each clock cycle / timestamp and often includes the timestamp value, data from each channel from each stream (neural, analog, etc.), and other metadata that might change on each packet. The multipart range request might also work well in streaming strided array data like this, that might otherwise be inefficient to stream.

This is not a priority. Just an idea that came up when brainstorming potential uses of LINDI, particularly for uncompressed source data.

rly commented 4 months ago

Amazon S3 doesn't support retrieving multiple ranges of data per GET request. https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html#API_GetObject_RequestSyntax

We are making requests with HTTP, not S3 though. (I also wonder if there is a performance difference in those protocols).

I haven't experimented with this at all. Just wanted to record my thoughts.

rly commented 4 months ago

A quick test suggests that multipart http get requests with DANDI do not seem to work. It either hangs or I get a SSLError: EOF occurred in violation of protocol (_ssl.c:2427)... further testing needed.

magland commented 4 months ago

I wasn't aware of multipart range requests.

I also experience an issue when trying to fetch multipart range request from dandi api.

I tried

curl -L -H "Range: bytes=0-100, 200-300" --output tmp.dat https://api.dandiarchive.org/api/assets/b707be3b-774e-4ff0-a18a-fea94cf56c6d/download/

and it seems to be trying to download the entire file

whereas single part seems to work

curl -L -H "Range: bytes=0-100" --output tmp.dat https://api.dandiarchive.org/api/assets/b707be3b-774e-4ff0-a18a-fea94cf56c6d/download/

Same issue if I try to fetch the s3 url directly.

rly commented 4 months ago

Yeah, I think this won't work then. Oh well.