HDFGroup / hsds

Cloud-native, service based access to HDF data
https://www.hdfgroup.org/solutions/hdf-kita/
Apache License 2.0
129 stars 53 forks source link

Vlen sequence of variable-length UTF-8 strings cannot be written #341

Closed mattjala closed 6 months ago

mattjala commented 6 months ago

Currently, testPutVLenUTF8 only checks that writing a sequence of UTF-8 strings returns 200. Adding a check to read back the value reveals that it's written incorrectly. Additionally, the write itself should be reported as a failure because _doHyperslabWrite fails, but doesn't raise an error to PUT_Chunk.

Addendum to testPutVLenUTF8 to elicit the error:

    # read values from dataset
    req = self.endpoint + "/datasets/" + dset_uuid + "/value"
    rsp = self.session.get(req, headers=headers)
    self.assertEqual(rsp.status_code, 200)
    rspJson = json.loads(rsp.text)
    self.assertTrue("value" in rspJson)
    value = rspJson["value"]
    self.assertEqual(value, data)

From the DN logs, the root cause of the failure is the frombuffer call in bytesToArray: ERROR> bytesToArray threw ValueError: e_buffer: b'one: \xe4\xb8\x80', dtype: object

mattjala commented 6 months ago

From further testing, this error doesn't have anything to do with UTF-8 encoding. It happens when attempting to write a sequence of variable-length strings of any type, because HSDS's current schema for sequences (4 byte length following by the rest of the data) has no way to delineate non-fixed length strings within that sequence.

jreadey commented 6 months ago

Doesn't testPutVLenCompoundBinary in vlen_test.py have vlen and fixed types in the same compound type?

mattjala commented 6 months ago

Doesn't testPutVLenCompoundBinary in vlen_test.py have vlen and fixed types in the same compound type?

Yes, but that test doesn't cover a variable-length sequence of variable length strings.

That said, one of the fields in that test is an H5T_ARRAY that contains two variable-length strings, which seems like it would have run into this same issue - how does it know where the first vlen string stops and the second vlen string starts? I'll take a look at that and see if whatever is done there could apply to this case.

jreadey commented 6 months ago

But currently vlen data isn't supported with H5T_ARRAY - see https://github.com/HDFGroup/hsds/blob/master/hsds/util/hdf5dtype.py, line 663

jreadey commented 6 months ago

It might be easiest to use msgpack going forward. I noodled around with it a bit today. See: https://gist.github.com/jreadey/acafbc048c13ed245beae22619c3ac5c

mattjala commented 6 months ago

But currently vlen data isn't supported with H5T_ARRAY - see https://github.com/HDFGroup/hsds/blob/master/hsds/util/hdf5dtype.py, line 663

H5T_ARRAY doesn't allow its base type to be a variable length sequence (type of class H5T_VLEN), but it is allowed to be a variable length string (string with length H5T_VARIABLE) - see https://github.com/HDFGroup/hsds/blob/ccd63c262575114fd69b1d330626b4f50a705dbe/tests/integ/vlen_test.py#L606

mattjala commented 6 months ago

This behavior is now properly tested as an expected failure (since vlen sequences of vlen data aren't supported) by testPutVlenVlenError