HDFGroup / hsds

Cloud-native, service based access to HDF data
https://www.hdfgroup.org/solutions/hdf-kita/
Apache License 2.0
126 stars 52 forks source link

Support fixed-length strings with UTF-8 character set #270

Closed mattjala closed 8 months ago

mattjala commented 8 months ago

HSDS currently does not support these (see hdf5dtype.py:617)

jreadey commented 8 months ago

Are they supported in the library?

ajelenak commented 8 months ago

Yes, string encoding and how many bytes are reserved for its storage are decoupled.

mattjala commented 8 months ago

Are they supported in the library?

Yep, see here for an example of fixed-length unicode strings being used in datasets/attributes - the native VOL passes both of these tests.

ajelenak commented 8 months ago

The question may be more related to how h5py treats HDF5 strings where this combo is not really supported. Any fixed-length string is treated as bytes object, not Unicode string.

jreadey commented 8 months ago

A fixed width unicode would be utf-32, but like @ajelenak says, it's not explicitly supported by the library. (or HSDS).

mattjala commented 8 months ago

A fixed width unicode would be utf-32, but like @ajelenak says, it's not explicitly supported by the library. (or HSDS).

I think there's a confusion in terminology here. The request is not support for a unicode character encoding where each particular character has a fixed width in bytes (e.g. UTF-32), but support for string datatypes that have a fixed total length in bytes (fixed length strings) AND have the character set/encoding UTF-8 (where a particular character does not have a fixed number of bytes associated with it).

I've updated the title of this issue to be more clear. The library does support fixed-length strings in UTF-8 (See the tests I linked above).

mattjala commented 8 months ago

Implemented in #278