HDFGroup / h5pyd

h5py distributed - Python client library for HDF Rest API
Other
110 stars 39 forks source link

hsload fails decoding ASCII encoded attributes #135

Closed mahoromax closed 1 year ago

mahoromax commented 1 year ago

When using hsload with a file (created from Matlab btw.) errors occur as soon as an ASCII-encoded (cset=H5T_CSET_ASCII) attribute with a special character is processed.

I do realize that this is probably a fault with the h5 library that created this file, not specifying UTF8 or an ASCII Extension (latin-1) here, but just pressing this into the file and pretending it is ASCII.

When I try to upload a file like this with hsload, the operation stops with the error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 0: invalid start byte

Following the stack traceback I can see that int attrs.py line 107 decoding is called as utf-8 without any checks what the used character set is.

I can see that all supported HDF5 charsets could be implemented in the long run, or using a converter.

I'd wish that at least some workaround would prevent the full command from failing

jreadey commented 1 year ago

Thanks for reporting this. Sounds like it should be fairly easy to add the check. Could you attach a sample HDF5 file which fails like this?

ajelenak commented 1 year ago

Hi @mahoromax,

Just to clarify:

I do realize that this is probably a fault with the h5 library that created this file, not specifying UTF8 or an ASCII Extension (latin-1) here, but just pressing this into the file and pretending it is ASCII.

It's not a fault of the HDF5 library but of the application that used it. Perhaps MATLAB has a special way of stating string encoding somewhere else in their HDF5 files.

Perhaps use of the chardet or cchardet package can help here in guessing the correct 8-bit encoding like latin-1 once UTF-8 turns out to be a wrong assumption.

mahoromax commented 1 year ago

Here are two (cut down) examples, the first (_transmission) is because of a ° symbol and the 2nd (130_100) because of an 'ä' char. Had to zip them so GitHub would accept them here. h5.zip

I think I used "h5 library" wrongly before, I meant that it was the wrappers implemented in Matlab. I'd have to investigate which versions of Matlab were used, it it could be different versions though.

jreadey commented 1 year ago

@mahoromax - thanks for the sample files. Your sample is a bit problematic as it has a an attribute value that can't be encoded as a utf-8 string. In h5pyd attribute values get encoded as strings to be sent as REST requests. On the HSDS side, attribute values get stored as JSON data (which doesn't allow arbitrarily byte data).

I'm working on a fix that will escape the problematic characters and still allow the client to get the same bytes back as was originally sent in. Should have something to try out soon.

jreadey commented 1 year ago

I have a fix checked into master now. This update will cause hsload to use escape codes for any problematic attributes. On reading you should get the original byte value back. E.g.:


$ hsload --loglevel warning 130_10000_trimmed.h5 /home/john/
WARNING 2023-01-24 19:43:38,706 utillib.py:324 byte value for attribute project in /run_01 is not utf8 encodable - using surrogateescaping

$ hsls --showattrs -r /home/john/130_10000_trimmed.h5
/ Group
/run_01 Group
   attr: author                   b'mmetzger'
   attr: comment                  Empty(dtype=dtype('S1'))
   attr: description              Empty(dtype=dtype('S1'))
   attr: kkn_CLASS                b'MSMTRUN'
   attr: kkn_MSMTRUN_VERSION      b'1.0'
   attr: msmt_type                b'kennlinie'
   attr: oil_ID                   b'T22-ST-001'
   attr: oil_type                 b'Shell Tellus 22'
   attr: pmanager                 b'cschaenzle'
   attr: project                  b'AIF - Experimentelle Validierung eines typenunabh\xe4ngigen Wirkungsgradmodells von Verdr\xe4ngerpumpen'
   attr: pump_manufacturer        b'Brinkmann'
   attr: testrig_name             b'hydraulic_large'
   attr: timestamp_created        b'2018-12-19T14:33:10'
/run_01/parameters Group
/run_01/pipelines Group
/run_01/unit_under_test Group
/run_01/unit_under_test/geometry Group
/run_01/unit_under_test/geometry/diameter_ds Dataset {1, 1}
   attr: description              b'diameter of drive spindle'
   attr: kkn_CLASS                b'PARAMETER'
   attr: kkn_PARAMETER_VERSION    b'1.0'
   attr: origin                   b'data sheet'
   attr: units                    b'millimeter'
   attr: variable                 b'diameter'

Let me know if this works ok for you with the un-trimmed files.

mahoromax commented 1 year ago

I'll test this within the next days.

But I'm still a bit confused why these letters/symbols (ä/°) cannot be UTF-8 encoded, they are part of the UTF-8 / Unicode charset?

What usually causes issues is documents saved / encoded in ISO 8859-1 interpreted as UTF-8, so what should ideally happen is a mapping from (well, how can it know what text was used to encode the texts?) to utf-8?

jreadey commented 1 year ago

The problem is that the data is note decodable as utf-8. HDF5 supports ascii and utf-8 encoding (though the library doesn't actually verify the coding is legit), whereas the data here is encoded as latin1.

Here's a python session that illustrates the problem:

>>> import h5py

>>> f = h5py.File("130_10000_trimmed.h5")

>>> grp = f["run_01"]

>>> data = grp.attrs["project"]

>>> data  # returns a bytes object
b'AIF - Experimentelle Validierung eines typenunabh\xe4ngigen 

>>> data.decode("ascii")  # not valid ascii (chars > 128)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 49: ordinal not in range(128)

>>> data.decode("utf-8")  # not valid utf8
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 49: invalid continuation byte

>>> data.decode("latin1")  # is valid latin1
'AIF - Experimentelle Validierung eines typenunabhängigen Wirkungsgradmodells von Verdrängerpumpen'
 von Verdr\xe4ngerpumpen'

>>> s = data.decode("latin1")  # can decode as latin1 and then encode as utf8

>>> s.encode("utf-8")
b'AIF - Experimentelle Validierung eines typenunabh\xc3\xa4ngigen Wirkungsgradmodells von Verdr\xc3\xa4ngerpumpen'

As @ajelenak suggests, there are some utilities to guess the encoding, but I thought it was less problematic to use Python surrogate escaping mechanism that works like this:

>>> data.decode("utf-8", errors="surrogateescape")
'AIF - Experimentelle Validierung eines typenunabh\udce4ngigen Wirkungsgradmodells von Verdr\udce4ngerpumpen'

When read back with h5pyd you'll get the same bytes data as with h5py. Hopefully the user will have the proper context to deal with it.

mahoromax commented 1 year ago

We didn't really notice this as an issue, because the HDF5Viewer is handling the files without any signs of problems.

In the middle of a word, this might not be a big issue, but for the degree symbol (°) or other's I'm not thinking about right now, we might need to reencode them eventually... Either reencode all strings encoded as ascii from Matlab, or parse for attributes that create issues and reencode those...

jreadey commented 1 year ago

I think it wouldn't be too hard to create a script that iterates through all attributes and converts any problematic strings to utf-8 (decode using latin-1 followed by an encode with utf-8).

jreadey commented 1 year ago

The above solution is implemented in h5pyd version 0.14.0. (using "surrogateescape" -- see https://docs.python.org/3/howto/unicode.html) .