Closed jreadey closed 8 months ago
This doesn't seem to work with UTF-8 strings that are sent as binary from the REST VOL. I'll try to get a test set up for this case in python.
It seems that HSDS treats the length-in-characters of the string as its datatype size. Then when a binary request comes in, the length-in-bytes of the same string is seen as being too large for the datatype (if one of the UTF-8 characters is multi-byte). HDF5 considers the size
field to be length-in-bytes, so HSDS should probably do the same.
Handling requests to write fixed-length UTF8 strings in binary instead of JSON is problematic with how numpy stores unicode strings.
When a client makes a binary write request, HSDS attempts to read the binary buffer into a numpy array with np.fromstring()
using a numpy datatype that is constructed with createDataType()
. In the case of a fixed-length UTF8 string datatype, the constructed numpy datatype is <UXX
, where XX
is the length of the string datatype. Numpy uses the UTF-32 encoding where each character is always four bytes, so it expects the string given to np.fromstring()
to be (about) four times larger than its UTF8 encoding in bytes is, preventing the call from succeeding.
Encoding the given UTF-8 binary to UTF-32 doesn't preserve the size, so the fixed length utf8 strings will no longer have a uniform length in bytes. This prevents np.fromstring()
from being used to parse the strings into elements of a single fixed-length datatype.
Creating a numpy unicode string datatype with a size that is one fourth the byte-length of the client's UTF-8 bytestring (so that numpy's internal datataype size matches the bytestring's actual size) allows the np.fromstring()
call to complete, but results in a numpy array with malformed UTF-32 strings that throws an error whenever you attempt to access an element from it.
This doesn't come up when writing the strings as JSON, since moving the data into the correct shape is handled by jsonToArray
in that case.
I'll create a PR with tests to illustrate this issue, though I'm not sure how to resolve it at the moment.
I've added @mattjala binary request tests and fixed some issues with UTF8 encoding...
Running these tests with a fresh environment caused them to pass. It seems that one of my dependencies was outdated, and that was changing the specifics of the encoding. Once the attribute binary test is in, this should be good to merge.
@mattjala - take a look at the revised PR!
Update for UTF8 fixed width strings