BlueBrain / HighFive

HighFive - Header-only C++ HDF5 interface
https://bluebrain.github.io/HighFive/
Boost Software License 1.0
673 stars 159 forks source link

Review H5Easy "extend/part" API. #1018

Open 1uc opened 3 months ago

1uc commented 3 months ago

In H5Easy there's API for reading and writing one element at a time: https://github.com/BlueBrain/HighFive/blob/5f3ded67b4a9928f4b9b5f691bc0a60aade32232/include/highfive/h5easy_bits/H5Easy_scalar.hpp#L66-L70

https://github.com/BlueBrain/HighFive/blob/5f3ded67b4a9928f4b9b5f691bc0a60aade32232/include/highfive/h5easy_bits/H5Easy_scalar.hpp#L120-L122

It does this by creating a dataset that can be extended in all directions; and automatically grows if the index of the element written requires it to do so. (Negating our ability to spot off-by-one programming errors.)

The API for reading/writing one element at a time feels like it would tempt users into writing files that way in a loop. Which is a rather serious issue on common HPC hardware (and not great on consumer hardware).

To enable this API it must make a default choice for the chunk size, currently 10^n. That seems very small and is at risk of creating files that can't be read efficiently. Picking it reasonably large might inflate the size of the file by a factor 100 or more.

I think it might be fine to allow users to read and write single elements of an existing dataset, i.e. without the automatically growing aspect; and a warning in the documentation to not use it in a loop. In core we support various selection APIs that are reasonably compact: list of random points, regular hyperslabs (general too) and there's a proposal to allow Cartesian products of simple selections along each axes.