NeurodataWithoutBorders / lindi

Linked Data Interface (LINDI) - cloud-friendly access to NWB data
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

single binary lindi using tar #89

Open magland opened 1 month ago

magland commented 1 month ago

A second attempt at a single file binary lindi format (the first attempt was #88). This one uses tar and a special index inspired by https://pypi.org/project/indexedtar/

See examples/write_lindi_binary.py

Will describe more later.

codecov-commenter commented 1 month ago

Codecov Report

Attention: Patch coverage is 43.55263% with 429 lines in your changes missing coverage. Please review.

Project coverage is 71.07%. Comparing base (ff05704) to head (1c495e5).

Files Patch % Lines
lindi/tar/lindi_tar.py 12.65% 207 Missing :warning:
lindi/LindiH5pyFile/LindiH5pyFile.py 57.96% 66 Missing :warning:
lindi/tar/create_tar_header.py 3.70% 52 Missing :warning:
lindi/tar/LindiTarStore.py 29.82% 40 Missing :warning:
...ndi/LindiH5pyFile/LindiReferenceFileSystemStore.py 62.85% 39 Missing :warning:
lindi/LindiH5ZarrStore/LindiH5ZarrStore.py 85.03% 19 Missing :warning:
...ndi/conversion/create_zarr_dataset_from_h5_data.py 81.25% 3 Missing :warning:
...indi/LindiH5pyFile/writers/LindiH5pyGroupWriter.py 0.00% 2 Missing :warning:
lindi/LindiH5ZarrStore/_util.py 75.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #89 +/- ## ========================================== - Coverage 79.47% 71.07% -8.40% ========================================== Files 30 33 +3 Lines 2265 2939 +674 ========================================== + Hits 1800 2089 +289 - Misses 465 850 +385 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

magland commented 1 month ago

The idea here is to support a lindi.tar format in addition to lindi.json, the advantage being that the .tar file can contain binary chunks -- see the intro comment of #88

The tar contains a top-level file called lindi.json, which is the usual JSON format, but it can refer to other files in the .tar with relative paths. So, if you extract the tar, then you can load the lindi.json without any problem.

But there are a couple challenges to overcome.

First, tar format does not allow deleting or resizing of files. This is a problem when we write to the lindi.tar, because the lindi.json needs to get updated.

The second issue is that .tar is not cloud-friendly in the index cannot get pulled down in one chunk. The files are stored sequentially with a header before each one. So if there are thousands of large files, you may need to make thousands of network requests.

So, to address these issues (and inspired by https://pypi.org/project/indexedtar/) here's what I did

There are two additional files in the .tar: .tar_entry.json and .tar_index.json. The first is a fixed size file that must always be the first in the archive. It contains information about the location in the file of .tar_index.json. This index has all the byte offsets of all the files in the archive (including lindi.json). So once we download that file, we're good to go.

Now let's say we're creating this lindi.tar locally. What happens when we need to update the lindi.json or the .tar_index.json? Well we can pad them with extra whitespace so that they can grow to a certain extent without changing any of the tar headers (we can just overwrite the content). Then if one of these files grows too large, we need to replace it. To do that, we just update the file name in the header to ./trash/xxxxx, effectively deleting it, and then append a new JSON to the end of the archive. And every time a new file (e.g., a blob) is appended to the archive, the index (content) needs to get update. If the index is ever replaced, the .tar_entry.json content needs to get updated - but that never changes size (and remember it's always the first file sequentially).

@rly @oruebel

magland commented 1 month ago

Also see this example/test, showing how you can treat the nwb.lindi.tar just like nwb (h5) -- no staging area needed.

https://github.com/NeurodataWithoutBorders/lindi/blob/lindi-tar/examples/example_tar_nwb.py

magland commented 1 month ago

And here's an example of ammending a remote h5 nwb file. The generated .lindi file has the new dataset chunks in it and also references the remote chunks.

https://github.com/NeurodataWithoutBorders/lindi/blob/lindi-tar/examples/example_ammend_remote_nwb.py

magland commented 1 month ago

Resulting from our discussion I decided to require the .tar extension, so that it's not just file.lindi with tar implied. You need to expicitly include the .tar extension. This leaves room for .lindi meaning something else in the future (perhaps a directory).

Important to note. If you extract a file.lindi.tar you will get a lindi.json and other files. You can then open that lindi.json using lindi python OR using fsspec, and it is valid zarr.