single file binary lindi

magland commented 4 months ago

(built on #84 )

Motivation

As we have discussed, there are advantages of file.nwb.lindi.json being in JSON format. It can be parsed from all different languages and all different tools. But there are some important limitations

If you want to create a new lindi file and add data to it without referencing an existing file, then you need to use a staging area for the binary chunks, and then manage those files, consolidate them, and coordinate things when it's time to upload to the cloud.
It doesn't fit with the DANDI approach of having each .nwb be its own file - you need to upload binary chunk files in addition to the .nwb.lindi.json
Even if you do upload to DANDI as in this example, then you cannot see the true file sizes, which I think is a big problem.
There are other disdvantages of having to deal with multiple files per NWB

So I was thinking, it would be nice to have a nwb.lindi binary file (no JSON) that has embedded in it the reference file system (as JSON) in way that is as easy as possible to parse out (of course it won't be as easy as being a true .json file), while at the same time allowing binary blobs to be appended to the .lindi file, and the RFS can refer to itself for those chunks.

Wait, isn't this reinventing HDF5?

Well there are some important drawbacks of HDF5

Not cloud-friendly -- the neurosift trick is not easy to pull off and is fragile (web assembly in web workers, etc)
Does not support custom codecs (AFAIK) whereas Zarr does. This is crucial for custom lossy compression of ephys and video datasets that I am working on.
Does not allow referencing data chunks from other files

Each of these are deal breakers for me, especially 2 and 3.

So here's a simple example script that shows how this could work:

import lindi

def write_lindi_binary():
    with lindi.LindiH5pyFile.from_lindi_file('test.lindi', mode='w', create_binary=True) as f:
        f.attrs['test'] = 42
        ds = f.create_dataset('data', shape=(1000, 1000), dtype='f4')
        ds[...] = 42

def test_read():
    f = lindi.LindiH5pyFile.from_lindi_file('test.lindi', mode='r')
    print(f.attrs['test'])
    print(f['data'][0, 0])
    f.close()

if __name__ == "__main__":
    write_lindi_binary()
    test_read()

What happens? A binary .lindi file is created and it doesn't depend on any staging area or other chunks.

Getting into the weeds, here's what the top of the test.lindi file looks like

{"format": "lindi1", "rfs_start": 1024, "rfs_size": 641, "rfs_padding": 1047935}

followed by a bunch of zero bytes.

When reading the file, we recognize it as "lindi1", lindi binary format. We see that the reference file system is embedded in the file at location 1024 with a size of 641. So then with a second request we can get the entire rfs just like we are reading a JSON. Here's what that looks like:

{"refs": {".zgroup": {"zarr_format": 2}, ".zattrs": {"test": 42}, "data/.zarray": {"ch
unks": [250, 500], "compressor": {"blocksize": 0, "clevel": 5, "cname": "lz4", "id": "blosc", "shuffle": 1}, "dtype": "<f4", "fill_value": 0
.0, "filters": null, "order": "C", "shape": [1000, 1000], "zarr_format": 2}, "data/0.0": [".", 1049600, 2040], "data/0.1": [".", 1
051640, 2040], "data/1.0": [".", 1053680, 2040], "data/1.1": [".", 1055720, 2040], "data/2.0": [".", 1057760, 2040], "data/2.
1": [".", 1059800, 2040], "data/3.0": [".", 1061840, 2040], "data/3.1": [".", 1063880, 2040]}}

You can see there are references to binary chunks - and the "." for the URL means that it's referring to locations within the file itself.

So when it comes time to write a new chunk, it is appended to the end of the file, and the reference file system is updated to point to that new chunk. On each file flush (or when file closes), the RFS is rewritten in the file and the top header is updated accordingly. But what if the RFS becomes too large and no longer fits in the pre-allocated padded space? Then a new space is allocated and appended at the end of the file, and the previous RFS is replaced by all zeros (to avoid confusion).

Datasets can be deleted, but it should be noted that there is no mechanism to actually free up that space in the file.

I did explore other options before inventing this format. Specifically tar, zip, parquet. None of these met all the needed criteria.

Happy to get your feedback @rly

codecov-commenter commented 4 months ago

Codecov Report

Attention: Patch coverage is 67.25441% with 130 lines in your changes missing coverage. Please review.

Project coverage is 78.23%. Comparing base (731fcb5) to head (5501db9). Report is 9 commits behind head on main.

Files	Patch %	Lines
lindi/lindi1/_lindi1.py	49.13%	59 Missing :warning:
lindi/lindi1/Lindi1Store.py	34.09%	29 Missing :warning:
lindi/LindiH5pyFile/LindiH5pyFile.py	66.66%	20 Missing :warning:
lindi/LindiH5ZarrStore/LindiH5ZarrStore.py	85.03%	19 Missing :warning:
...ndi/LindiH5pyFile/LindiReferenceFileSystemStore.py	95.34%	2 Missing :warning:
lindi/LindiH5ZarrStore/_util.py	75.00%	1 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #88 +/- ## ========================================== - Coverage 79.43% 78.23% -1.21% ========================================== Files 30 33 +3 Lines 2256 2600 +344 ========================================== + Hits 1792 2034 +242 - Misses 464 566 +102 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

oruebel commented 4 months ago

Does not support custom codecs (AFAIK) whereas Zarr does.

HDF5 does support plugins https://pypi.org/project/hdf5plugin/ but the plugins are implemented in C/C++

Does not allow referencing data chunks from other files

I think this is possible via HDF5 virtual datasets https://docs.hdfgroup.org/archive/support/HDF5/docNewFeatures/NewFeaturesVirtualDatasetDocs.html

Happy to get your feedback

I'm wondering whether it would be easier to embed all this data in an HDF5 file as a container format, rather than using a fully custom binary. I.e., you'd have a dataset for the LINDI JSON and then any binary blocks could be stored as datasets. If the datasets are stored without chunking then they are still directly addressable via memory offset and readable via memmap. I.e., you wouldn't use all the fancy features of HDF5 (chunking, compression etc.) but you would have the advantage of still having a self-describing file, rather than a fully custom binary format.

magland commented 4 months ago

I'm wondering whether it would be easier to embed all this data in an HDF5 file as a container format, rather than using a fully custom binary. I.e., you'd have a dataset for the LINDI JSON and then any binary blocks could be stored as datasets. If the datasets are stored without chunking then they are still directly addressable via memory offset and readable via memmap. I.e., you wouldn't use all the fancy features of HDF5 (chunking, compression etc.) but you would have the advantage of still having a self-describing file, rather than a fully custom binary format.

That's an interesting idea. The trick would be how to embed the RFS in the HDF5 in such a way that it would be easy to extract out without needing to use the HDF5 driver. Ironically we only need to attach two integers to the HDF5 file somehow to give the start and end byte to the RFS, and then we're off to the races. Let me think about how that might be done.

oruebel commented 4 months ago

The trick would be how to embed the RFS in the HDF5 in such a way that it would be easy to extract out without needing to use the HDF5 driver.

According to the HDF5 spec: "The superblock may begin at certain predefined offsets within the HDF5 file, allowing a block of unspecified content for users to place additional information at the beginning (and end) of the HDF5 file without limiting the HDF5 Library’s ability to manage the objects within the file itself. This feature was designed to accommodate wrapping an HDF5 file in another file format or adding descriptive information to an HDF5 file without requiring the modification of the actual file’s information. The superblock is located by searching for the HDF5 format signature at byte offset 0, byte offset 512, and at successive locations in the file, each a multiple of two of the previous location; in other words, at these byte offsets: 0, 512, 1024, 2048, and so on." So you should be able to place additional information (like byte offsets) at the beginning of the file and still use HDF5.

https://docs.hdfgroup.org/hdf5/v1_14/_f_m_t3.html

magland commented 4 months ago

Cool. Now I'm reading through the rest of that document and I don't see where to put said unspecified data. Are you able to make sense of it?

oruebel commented 4 months ago

Cool. Now I'm reading through the rest of that document and I don't see where to put said unspecified data. Are you able to make sense of it?

h5py does expose the user block directly via the API https://docs.h5py.org/en/stable/high/file.html#user-block . ChatGPT gives nice instructions how to do this with h5py.

Following text is generated by ChatGPT

Creating and modifying the user block of an HDF5 file using h5py involves several steps. The user block is a reserved space at the beginning of an HDF5 file that can be used for storing user-defined data. This block is typically used for embedding non-HDF5 metadata, such as file headers for other formats.

Here are the steps to create and modify the user block in an HDF5 file using h5py:

Step 1: Creating an HDF5 File with a User Block

When creating a new HDF5 file, you can specify the size of the user block using the userblock_size parameter in the h5py.File constructor.

import h5py

# Specify the size of the user block (must be a power of 2 and at least 512 bytes)
userblock_size = 512

# Create a new HDF5 file with the specified user block size
with h5py.File('example.h5', 'w', userblock_size=userblock_size) as f:
    # Create some datasets or groups if needed
    f.create_dataset('dataset', data=[1, 2, 3])

print("HDF5 file with user block created.")

Step 2: Writing to the User Block

To write data to the user block, you need to open the file in binary mode and write directly to the beginning of the file.

# Open the file in binary mode to write to the user block
with open('example.h5', 'r+b') as f:
    # Write some data to the user block
    user_block_data = b'This is some user block data.'
    f.write(user_block_data)

print("Data written to the user block.")

Step 3: Reading from the User Block

To read data from the user block, you again open the file in binary mode and read the desired number of bytes from the beginning of the file.

# Open the file in binary mode to read from the user block
with open('example.h5', 'rb') as f:
    # Read the data from the user block
    user_block_data = f.read(userblock_size)

print("Data read from the user block:", user_block_data.decode('utf-8'))

Step 4: Modifying the User Block

To modify the user block, you can overwrite the desired portion of the user block by seeking to the appropriate position and writing the new data.

# Open the file in binary mode to modify the user block
with open('example.h5', 'r+b') as f:
    # Seek to the beginning of the user block
    f.seek(0)
    # Write new data to the user block
    new_user_block_data = b'Updated user block data.'
    f.write(new_user_block_data)

print("User block data modified.")

Important Considerations

User Block Size: The size of the user block must be a power of 2 (e.g., 512, 1024, 2048) and at least 512 bytes.
Data Overlap: Be cautious not to overwrite the HDF5 metadata or datasets when writing to the user block.
File Mode: Always open the file in binary mode ('rb' or 'r+b') when reading or writing raw bytes to/from the user block.

By following these steps, you can create and modify the user block of an HDF5 file using h5py.

magland commented 4 months ago

Thanks @oruebel

There's another disadvantage of hdf5 -- difficult to do parallel writes.

I'm working on a second possible solution #89

magland commented 3 months ago

Closing in favor of #89

NeurodataWithoutBorders / lindi