HDFGroup / hsds

Cloud-native, service based access to HDF data
https://www.hdfgroup.org/solutions/hdf-kita/
Apache License 2.0
126 stars 52 forks source link

add support for fixed width UTF8 strings - #270 #278

Closed jreadey closed 8 months ago

jreadey commented 8 months ago

Update for UTF8 fixed width strings

mattjala commented 8 months ago

This doesn't seem to work with UTF-8 strings that are sent as binary from the REST VOL. I'll try to get a test set up for this case in python.

mattjala commented 8 months ago

It seems that HSDS treats the length-in-characters of the string as its datatype size. Then when a binary request comes in, the length-in-bytes of the same string is seen as being too large for the datatype (if one of the UTF-8 characters is multi-byte). HDF5 considers the size field to be length-in-bytes, so HSDS should probably do the same.

mattjala commented 8 months ago

Handling requests to write fixed-length UTF8 strings in binary instead of JSON is problematic with how numpy stores unicode strings.

When a client makes a binary write request, HSDS attempts to read the binary buffer into a numpy array with np.fromstring() using a numpy datatype that is constructed with createDataType(). In the case of a fixed-length UTF8 string datatype, the constructed numpy datatype is <UXX, where XX is the length of the string datatype. Numpy uses the UTF-32 encoding where each character is always four bytes, so it expects the string given to np.fromstring() to be (about) four times larger than its UTF8 encoding in bytes is, preventing the call from succeeding.

Encoding the given UTF-8 binary to UTF-32 doesn't preserve the size, so the fixed length utf8 strings will no longer have a uniform length in bytes. This prevents np.fromstring() from being used to parse the strings into elements of a single fixed-length datatype.

Creating a numpy unicode string datatype with a size that is one fourth the byte-length of the client's UTF-8 bytestring (so that numpy's internal datataype size matches the bytestring's actual size) allows the np.fromstring() call to complete, but results in a numpy array with malformed UTF-32 strings that throws an error whenever you attempt to access an element from it.

This doesn't come up when writing the strings as JSON, since moving the data into the correct shape is handled by jsonToArray in that case.

I'll create a PR with tests to illustrate this issue, though I'm not sure how to resolve it at the moment.

jreadey commented 8 months ago

I've added @mattjala binary request tests and fixed some issues with UTF8 encoding...

mattjala commented 8 months ago

Running these tests with a fresh environment caused them to pass. It seems that one of my dependencies was outdated, and that was changing the specifics of the encoding. Once the attribute binary test is in, this should be good to merge.

jreadey commented 8 months ago

@mattjala - take a look at the revised PR!