Problems with HDF5 as a cross-platform binary data container

BenBrock commented 2 years ago

I'm creating this to record a few issues I'm encountering using HDF5 in C++. It's possible some of these issues are fixed in the Python bindings or by NetCDF, but they seem to remain as issues in C/C++.

1) A significant amount of work is (likely) going to be required to make HDF5 deal with endianness properly in all cases. HDF5 has two kinds of types, standard types, which are fully defined with a certain bit width, endianness, etc., and native types, which are platform defined. STD_I32LE, a two's complement standard 32-bit signed integer in little-endian format, is one such standard format, and NATIVE_INT is a native format (corresponding to a C/C++ int). When we store an array in an HDF5 dataset, we must pick a native type that corresponds to the in-memory representation and a standard type that corresponds to what we want stored on disk. When we read, we just give a native type corresponding to the read buffer, since the in-file format is already set in stone. For integer types, this works well, as for variants of int we can easily test whether we have an int32_t, uint32_t, etc. However, for some standard types like std::size_t, std::wchar_t, and even char, which are typically equivalent to, but distinct from some integer type, we (the implementer) have to pick. I've picked some defaults that work well on Intel processors, and likely on most modern systems, but personally I would like HDF5 to handle the endianness issue for me here.

2) No (or at least poor) support for storing UTF-8 text in datasets.

3) No support for bit arrays.

4) For user-defined types, it certainly seems like some some work would be required on the user's part, as there is an MPI-like specification for user-defined data types.

gheber commented 1 year ago

Apologies for my ignorance, but can we get a definition or list of requirements for a "cross-platform binary data container"?

steven-varga commented 1 year ago

Did you have a chance to take a look at H5CPP? Also recommend lighning talk and these slides

In my understanding with the exception of 3. No support for bit arrays all questions has been addressed; nothing particular useful of the bit arrays when working with BLAS/LAPACK and if I am mistaken please let me know. I would like HDF5 to handle the endianness issue for me here. in my understanding and personal experience indeed this is what HDF5 format does: hides the platform specific details, including endiannes.

BenBrock commented 1 year ago

Hi @gheber and @steven-varga , thanks for taking an interest. I have not taken a look at H5CPP, but had been trying to use the HDF5 C++ bindings, which is perhaps in large part responsible for my woes.

What we need is to take an array and read/write it to/from disk in a cross-platform way. My primary pain point with HDF5 was having to pick an HDF5 type when copying to/from arrays. My feeling was that I'm likely to make a mistake and pick the wrong type for some platforms.

For example, if I'm copying an array into a dataset, I need to select an HDF5 type for the new dataset. As far as I can tell, there's no built-in mechanism in HDF5 to get an HDF5 type for a particular C++ type, so I wrote one of my own. However, this feels very error prone---for example, it's unclear which HDF5 native type to pick for size_t, whose size is implementation-defined. Types like short and wchar_t have similar problems in that I'm not sure how to select the correct HDF5 type in a cross-platform way.

@steven-varga , maybe you can shed some light on how this is handled in H5CPP?

steven-varga commented 1 year ago

In the official version you have the option of:

T := ([unsigned] ( int8_t | int16_t | int32_t | int64_t )) | ( float | double  )
S := T | c/c++ struct | std::string
ref := std::vector<S> 
    | arma::Row<T> | arma::Col<T> | arma::Mat<T> | arma::Cube<T> 
    | Eigen::Matrix<T,Dynamic,Dynamic> | Eigen::Matrix<T,Dynamic,1> | Eigen::Matrix<T,1,Dynamic>
    | Eigen::Array<T,Dynamic,Dynamic>  | Eigen::Array<T,Dynamic,1>  | Eigen::Array<T,1,Dynamic>
    | blaze::DynamicVector<T,rowVector> |  blaze::DynamicVector<T,colVector>
    | blaze::DynamicVector<T,blaze::rowVector> |  blaze::DynamicVector<T,blaze::colVector>
    | blaze::DynamicMatrix<T,blaze::rowMajor>  |  blaze::DynamicMatrix<T,blaze::colMajor>
    | itpp::Mat<T> | itpp::Vec<T>
    | blitz::Array<T,1> | blitz::Array<T,2> | blitz::Array<T,3>
    | dlib::Matrix<T>   | dlib::Vector<T,1> 
    | ublas::matrix<T>  | ublas::vector<T>
ptr     := T* 
accept  := ref | ptr

Here is a link to examples; unfortunately the website has been down for a while, because of an unrelated computer security event. When schedule allows will restore/resume development.

Looking at your example: it is a good start, however there is a lot more to it to get it right, and in my experience it should look structurally different from yours. If you are interested in the details, thanks to the HDFGroup, few years ago I spoke about this at various events, you maybe able to find the links on the HDFGroup C++ mailing lists; less fortunate that the material is on the currently downed website.

Please use supported utf8_t instead of wchar_t, then at higher level manage your conversions with methods supported by C++. Yes, in this case minimalism pays off, a lot...
You don't need to 'copy' POD instead do the IO from the memory location, Here is an example from typed memory one with std::vector<T> and this comma separated value parser walk you through how to handle stream of records.

Most functionality of HDF5 C are implemented/supported, works with MPI, it is suitable for financial/trading systems and many labs has been using it successfully over the world.

derobins commented 1 year ago

Sorry if I'm late to the party, but why do you think HDF5 doesn't handle the byte order for you? When you create a dataset, you do, in fact, have to specify the way you want to store the data, but you can then use H5T_NATIVE_INT, etc. when you call H5Dread() to munge the data into and out of whatever your buffer datatype is. When you do this, HDF5 will handle the type conversion for you, including BE/LE byte swapping.

What HDF5 will not do for you, however, is guess what is going to be an efficient datatype for you. You have to decide if you want a BE or LE datatype, for example. Given the low number of BE systems these days, I'd probably go with LE so you don't waste time munging bytes every time you perform I/O on LE systems. Also, C and C++'s original type system is vague on purpose, so you'll have to figure out what is appropriate for storing, say, long integers, that will work across the systems you support. You can, of course, specify the native type as the dataset type when you create the dataset, but that would potentially make your HDF5 files differ across platforms.

HDF5 also isn't going to guess what's a great equivalent for system-dependent types like size_t, either. By design those are system dependent, so like the platform-dependent legacy C/C++ integer types, no container will be perfect for all systems.

This shouldn't be too onerous, though - Most systems you'll find in the real world will be LE systems, so BE will be less of a concern unless you are on SPARC or Power. Most of the legacy integer types are the same across all platforms aside from long, which differs on Windows (LLP64) and everything else (LP64), so you can use H5TSTD(U|I)64LE, which will work for any system. If you are concerned with storing system-y stuff like size_t, picking H5T_STD_U64LE would be appropriate since I don't know of any realistic plans for 128-bit address spaces.

GraphBLAS / binsparse-specification

Problems with HDF5 as a cross-platform binary data container #9