H5DataSet read and write H5std_string functions are likely to get called accidentally

calvertdw commented 1 year ago

The H5DataSet read and write functions have options for doing so with either a void* or a H5std_string.

We ran into an issue with a Java binding for HDF5, documented here.

The issue was that in trying to read and write bytes to an H5DataSet, we were accidentally calling the H5std_string versions without knowing it and the read function in that case will terminate the read on any 0 bytes.

It took a while to figure out why we couldn't read and write bytes when other types (ints, floats, etc) worked fine. Our team spent several days scratching our heads.

I was wondering if the same issue might apply to all users of HDF5, who might be wishing to read char* with 0s in it and getting unexpected results.

Once possible solution would be to rename the H5std_string versions to readString and writeString. What do you guys think?

Thanks!

mkitti commented 1 year ago

Just to clarify, you are using the JavaCpp org.bytedeco bindings which are automated from the C++ bindings rather than the hdf.hdf5lib bindings, right?

https://docs.hdfgroup.org/hdf5/develop/group___j_h5_d.html

bmribler commented 1 year ago

I'll check about the H5DataSet read and write

calvertdw commented 1 year ago

Yes that is correct @mkitti, I am referring to and JavaCPP binds to the C++ API.

bmribler commented 1 year ago

Just to clarify, you are using the JavaCpp org.bytedeco bindings which are automated from the C++ bindings rather than the hdf.hdf5lib bindings, right?

https://docs.hdfgroup.org/hdf5/develop/group___j_h5_d.html

Thank you!

mkitti commented 1 year ago

Yes that is correct @mkitti, I am referring to and JavaCPP binds to the C++ API.

Isn't this a bug for JavaCPP then?

calvertdw commented 1 year ago

@mkitti Well I originally reported this issue there and @saudet recommended I pursue the name change of the read/write string functions upstream. The 0 termination behavior is very different and I think would warrant the name change. Even better if the functions were documented as such.

https://github.com/bytedeco/javacpp-presets/issues/1311

mkitti commented 1 year ago

I agree the 0 termination issue is strange if you were using a Java API such as the one that HDF Group provides.

However, you are using an automatically generated Java API from a C/C++ API where null terminated strings are normal. Your requested fix is to change the C/C++ API.

Wouldn't it make sense to use the canonical Java API instead to resolve this?

mkitti commented 1 year ago

@calvertdw , I think you what you want is hdf.hdf5lib.H5.H5Dread_string: https://docs.hdfgroup.org/hdf5/develop/group___j_h5_d.html#gaca7b980e81fc5d46ae316c0b3a1fa31e https://mkitti.github.io/janelia/javacpp-hdf5/1.14.0-1.5.9-SNAPSHOT/javadoc/hdf/hdf5lib/H5.html#H5Dread_string(long,long,long,long,long,java.lang.String%5B%5D)

mkitti commented 1 year ago

Specifcally the JNI function NewStringUTF https://github.com/HDFGroup/hdf5/blob/248045e3e0e26a81ec634682c7267f8e23d0ffd8/java/src/jni/h5dImp.c#L1033 would use the modified UTF-8 encoding documented here: https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/io/DataInput.html#modified-utf-8 avoiding the internal null bytes.

calvertdw commented 1 year ago

Wouldn't it make sense to use the canonical Java API instead to resolve this?

The JNI that HDF5 provides isn't sufficient for our large team and codebase. The Bytedeco JavaCPP library provides many crucial benefits like portability and seamless use of many AI related C/C++ libraries. It is available on Maven Central with prebuilt binaries for virtually every platform included in the artifacts.

I would consider HDF5 to be a very "enterprise" level library, such that the biggest users are likely to be organizations which would value the features that Bytedeco provides.

With that said, is there some reason for not changing the names of the string functions to readString and writeString? It seems if you wanted to read/write char* with HDF5 which includes 0s, users are going to struggle to find why their data is terminated early on the read. HDF5 is for storing data, and an array of chars including 0 is just data. Currently, users would have to cast to void* to use to correct functions and HDF5 leaves every C++ user to fend for themselves on that one.

mkitti commented 1 year ago

The JNI that HDF5 provides isn't sufficient for our large team and codebase. The Bytedeco JavaCPP library provides many crucial benefits like portability and seamless use of many AI related C/C++ libraries. It is available on Maven Central with prebuilt binaries for virtually every platform included in the artifacts.

I would consider HDF5 to be a very "enterprise" level library, such that the biggest users are likely to be organizations which would value the features that Bytedeco provides.

As of https://github.com/bytedeco/javacpp-presets/pull/1327 you can use hdf.hdf5lib.* via the JavaCPP distribution as well. This will probably be released along with JavaCPP 1.5.9.

For now you can use the instructions at http://bytedeco.org/builds/ to pull the snapshots via maven or ask saudet when the release will occur: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/hdf5/1.14.0-1.5.9-SNAPSHOT/

With that said, is there some reason for not changing the names of the string functions to readString and writeString? It seems if you wanted to read/write char* with HDF5 which includes 0s, users are going to struggle to find why their data is terminated early on the read. HDF5 is for storing data, and an array of chars including 0 is just data. Currently, users would have to cast to void* to use to correct functions and HDF5 leaves every C++ user to fend for themselves on that one.

It probably would have be an additional method rather than a name change in the near future as to not break existing applications.
The main focus of HDF Group is the C API (e.g. H5Dread which uses void *)
There are other actively developed C++ bindings such as @steven-varga's https://github.com/steven-varga/h5cpp or https://bluebrain.github.io/HighFive/
It's already been marked as low priority

There probably would be better traction in clearly stating at that reading a fixed length string into std::string should not occur via operator= (const char* s) but rather the two argument string (const char* s, size_t n) constructor with the second argument indicating the length explicitly.

https://github.com/HDFGroup/hdf5/blob/79bb60c3f6f67411e5d70b84743fc9f6b6143cbc/c%2B%2B/src/H5DataSet.cpp#L732

calvertdw commented 1 year ago

Thanks for the detailed reply.

The statements "It probably would have be [...] as to not break existing applications" and "The main focus of HDF Group is the C API" seem to conflict. If you aren't focusing on the C++ API anyway, why spend extra effort maintaining backwards compatibility?

an additional method rather than a name change

Well this wouldn't work, because the old string method having the same signature is the actual issue here.

Additionally, the fixed length string reader doesn't help in the situation when you don't know beforehand how long it is. I guess you could find that out using other methods, but that's laborious for no reason.

mkitti commented 1 year ago

To be clear, I am not HDF Group.

Well this wouldn't work, because the old string method having the same signature is the actual issue here.

I disagree. The problem is that the C++ API always makes the assumption that it is reading a null terminated C string even when the true length is known. Rather the C++ API should create a C++ std::string of the known length. In your case, JavaCPP can then translate the std::string of the correct length to a java.lang.String, properly. This is a bug in the C++ API and we should fix the C++ API. Then H5DataSet::read will just work correctly for everyone.

There are three ways to encode a string in HDF5 as detailed in following URL. https://docs.hdfgroup.org/hdf5/v1_14/_h5_t__u_g.html#subsubsec_datatype_other_strings

An array of characters of known length
A fixed length string of known length
A variable length null terminated string

Do you have a variable length string with one or more embedded nulls? If so, how do you know the actual length of the string?

mkitti commented 1 year ago

I filed https://github.com/HDFGroup/hdf5/issues/3034 in relation to this post. If we resolved that, then we should be able to read embedded null bytes into C++'s std::string. Then I presume JavaCPP should be able to read that into Java.

steven-varga commented 1 year ago

An interesting conversation: std::string is a template and indeed allows storing any data, HDF5 C API is a storage system with some constraints and it's fixed/variable length string datatype AFAIK can be represented within the std::basic_string C++ datatype, a superset of HDF5 H5_STRING datatype. The latter consists of a null terminated sequence of H5T_NATIVE_CHAR and the C definition of char is: A C-style string is an array of characters that is terminated by a null character ('\0').

Lib HDF5 defines strings as a sequence of ASCII or utf8 characters, neither of the standards allow null in the middle. Is there anything wrong with this interpretation?

<string> ::= H5T_STRING {
                 STRSIZE <strsize>;
                 STRPAD <strpad>;
                 CSET <cset>;
                 CTYPE <ctype>;
             }
<strsize> ::= <int_value>
<strpad> ::= H5T_STR_NULLTERM | H5T_STR_NULLPAD | H5T_STR_SPACEPAD
<cset> ::= H5T_CSET_ASCII | H5T_CSET_UTF8
<ctype> ::= H5T_C_S1 | H5T_FORTRAN_S1

mkitti commented 1 year ago

I believe it may be possible to have a fixed length H5T_FORTRAN_S1 that contains null bytes. I believe that should be able to be represented in C++ as std::string, a specialization of the std::basic_string template.

steven-varga commented 1 year ago

Some implementations allow non-conforming output, if I understood you right Fortran can insert a null in the middle of a sequence of bytes (which are not UTF8 or ASCII strings); but other conforming libraries will ignore the data after the /0 so this feature will result in hard to detect non-conforming behaviour -- which of course could be a use case for hiding some information.

Yes, you are correct to say that std::basic_string<T> is not limited to uft8 or ascii, but the inverse doesn't work ie: an arbitrary byte sequence is not a valid H5_STRING type -- given that my interpretation is correct (which is AFAIK matching with h5py ).

OTOH: Opaque is a good candidate to save binary content, and personally this is what I do. Binary to string encoding such as base64 and variants are alternatives, but will look unusual.

Probably this could make a good topic for @gheber Call the Doctor ?

steven-varga commented 1 year ago

@mkitti convinced me to reconsider, and from then on I can't un-see this: <ctype> ::= H5T_C_S1 | H5T_FORTRAN_S1 The second term H5T_FORTRAN_S1 indeed allows 0-255 values as characters, how they should behave printed on the terminal is not the concern here, but how the sequence is read in from storage: must be the entire length as describe by the Fortran spec.

Since C++ std::basic_string<T> supports this, when the H5T_FORTRAN_S1 set, the entire length should be read in, and stored within the string starting from: std::basic_string<T>::data()

HDFGroup / hdf5

H5DataSet read and write H5std_string functions are likely to get called accidentally #2501