clbarnes / jeiss-convert

Convert Jeiss .dat files to HDF5
MIT License
2 stars 2 forks source link

Header as a HDF5 compound datatype #2

Open mkitti opened 2 years ago

mkitti commented 2 years ago

HDF5 has the capability to create a HDF5 compound dataset, which is analogous to a C struct.

https://portal.hdfgroup.org/display/HDF5/Datatype+Basics#DatatypeBasics-compound https://api.h5py.org/h5t.html#compound-types

It may be possible to also construct this from a NumPy record. I suspect it may be easier to use the low-level API from the CSV files that you created.

mkitti commented 2 years ago

h5py then provides the ability to read individual fields directly. https://docs.h5py.org/en/stable/high/dataset.html?highlight=compound#reading-writing-data

clbarnes commented 2 years ago

Presumably this would end up very close to Davis' approach here? https://github.com/janelia-cosem/fibsem-tools/blob/f4bedbfc4ff81ec1b83282908ba6702baf98c734/src/fibsem_tools/io/fibsem.py#L81

It's smart and probably a better representation of what's going on, but is this kind of access standard across common HDF5 implementations? The HDF5 spec is colossal so it wouldn't surprise me if many APIs only cover a subset of its functionality; in that case I'd prefer to target that common subset of "basic" features rather than go deep into the HDF5 spec to find something which is technically allowed but not available to many users.

mkitti commented 2 years ago

I was thinking of this as a way to encode the jeiss-convert tsv files as a datatype in HDF5 itself. In the worse case scenario, one could always use H5Dread to just read the bytes giving uint8 as the memory type, which is the status quo.

Many packages support compound datatypes. Perhaps the most common use of compound datatype is complex numbers.

Java: https://bitbucket.hdfgroup.org/pages/HDFFV/hdf5doc/master/browse/html/javadoc/index.html?hdf/hdf5lib/H5.html MATLAB: https://www.mathworks.com/help/matlab/import_export/import-hdf5-files.html Julia: https://juliaio.github.io/HDF5.jl/stable/#Supported-data-types

mkitti commented 2 years ago

JHDF5 which is currently used by the Java tools BigDataViewer and SciJava (FIJI, etc.) has a compound datatype reader here: https://svnsis.ethz.ch/doc/hdf5/hdf5-19.04/ch/systemsx/cisd/hdf5/IHDF5CompoundReader.html

mkitti commented 2 years ago

@clbarnes , let me know if you have time to chat for a few minutes. One concern about embracing HDF5 for this is that we're not sure if this works for everyone at Cambridge. Albert in particular seemed to prefer text based attributes via JSON or similar.

clbarnes commented 2 years ago

I actually have a fork which writes to zarr, which is exactly that - a JSON file for the metadata, plus a npy-esque binary dump (which can be chunked). Zarr is getting a lot of attention but the spec is anticipated to change some time soon, in a way which will make it less convenient for this sort of thing.

I'm flexible for the rest of the week if we can figure out time differences! I'm in BST.

mkitti commented 2 years ago

Yes, I participated in the discussion on the Zarr shard specification that should be part of v3: https://github.com/zarr-developers/zarr-python/pull/876#issuecomment-985831774 It looks like a HDF5 file with an extra linear dataset could also be a Zarr shard.

Extracting that indexing from HDF5 should be quite fast if we use H5Dchunk_iter currently in HDF5 1.13 or or the h5ls command line utility: https://docs.hdfgroup.org/hdf5/develop/group___h5_d.html#gac482c2386aa3aea4c44730a627a7adb8

Another extreme is https://github.com/HDFGroup/hdf5-json

Nonetheless, once we have the data in one standard format, I do not mind investing in tooling to move between standard formats or using something like kerchunk. The best part is that tooling may already exist.

clbarnes commented 2 years ago

I have an implementation of this with a convenient Mapping wrapper, which round-trips correctly through bytes. What I'm trying to figure out now is where it fits with the rest of program as it currently stands - if the header is written into the HDF5 as this compound dtype array, do we still want to encode the same metadata as attributes, which is the more HDF5-y way to do it? That duplication concerns me a bit. If not, then we've made the attributes a bit more awkward to access. Is having today's header encoded byte-for-byte in the HDF5 file a goal in its own right?

It also gets more complicated to add the zarr/n5 implementations, which don't support compound dtypes (to my knowledge). In these cases, you'd need to serialise the metadata as attributes anyway (which, again, is more convenient for downstream users anyway). I'm not entirely convinced zarr/n5 support is a good way to go anyway - keeping everything contained in the same file and having a single supported workflow from proprietary to open format is of benefit, and given that these files will almost certainly require post-processing, downstream users can write to other formats at that stage if they want.

mkitti commented 2 years ago

Is having today's header encoded byte-for-byte in the HDF5 file a goal in its own right?

This was a stated goal of the last round to help ensure round trip durability. Originally it was just going to be an opaque datatype or byte stream, but I realized that we may be able to do better with the compound datatype. We do not want to depend on someone bumping the version number or the accuracy of the reader's table of offsets and types in order to preserve the header.

One option might be to save the 1 KB header as a separate file for reference. For Zarr this might just be an opaque block of bytes. N5 has N5-HDF5 backend that may be able to take advantage of the compound datatype.

clbarnes commented 2 years ago

My current implementation does store the raw header as well as the exploded metadata values, without using the compound dtype. For HDF5, there is a u8 attribute "_header" (as well as "_footer"); for the N5 and Zarr implementations in the PR, these are hex-encoded strings (open to using base64 as well). The tests round trip from the exploded values, rather than relying on the byte dump.

The compound dtype is just calculated from the table of offsets and dtypes so isn't any more robust in that respect. I don't think there's a better way to do that which doesn't just duplicate the information and introduce a new source of error. The reader doesn't need to explicitly state the version as it's read from the metadata, and (in my implementation anyway) will fail if the version's spec isn't known.