MIT-LCP / wfdb-python

Native Python WFDB package
MIT License
738 stars 300 forks source link

When reading edf, decode all strings using 'iso8859-1'. #429

Closed Ivorforce closed 1 year ago

Ivorforce commented 1 year ago

The edf specifications notes all strings being 'ascii'.

By default, decode() uses the 'utf-8' encoding. In some cases, this can cause decoding issues:

File ~/Library/Caches/pypoetry/virtualenvs/workspace-lukas-kum-5zRav6QH-py3.9/lib/python3.9/site-packages/wfdb/io/convert/edf.py:262, in read_edf(record_name, pn_dir, header_only, verbose, rdedfann_flag)
    259 physical_dims = []
    260 for _ in range(n_sig):
    261     physical_dims.append(
--> 262         struct.unpack("<8s", edf_file.read(8))[0].decode().strip()
    263     )
    264 if verbose:
    265     print("Physical Dimensions: {}".format(physical_dims))

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 0: invalid start byte

Arguably, the encoding ascii should be used. However, in my case, this didn't completely solve the problem:

File ~/Library/Caches/pypoetry/virtualenvs/workspace-lukas-kum-5zRav6QH-py3.9/lib/python3.9/site-packages/wfdb/io/convert/edf.py:262, in read_edf(record_name, pn_dir, header_only, verbose, rdedfann_flag)
    259 physical_dims = []
    260 for _ in range(n_sig):
    261     physical_dims.append(
--> 262         struct.unpack("<8s", edf_file.read(8))[0].decode("ascii").strip()
    263     )
    264 if verbose:
    265     print("Physical Dimensions: {}".format(physical_dims))

UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 0: ordinal not in range(128)

Using the admittedly peculiar unicode_escape encoding solved the problem for me. Using the verbose printing mode, the culprit can be seen (µ):

Physical Dimensions: ['', 'mV', 'mV', 'mV', 'mV', 'mV', 'mV.s', 'µV', 'mV', 'mV.s', 'µV', '', 's', '', '', 'V']

The file in question was produced by LabView. Unfortunately, I cannot share it. If need be, it might be possible to reproduce the error otherwise.

bemoody commented 1 year ago

Is decode("unicode_escape") somehow different from decode("iso8859-1")? It looks to me like it does the same thing.

I think if you find a random EDF file, it's more likely to be encoded in ISO-8859-1 (or Windows-1252) than UTF-8, but it'd be nice for the wfdb package to be "locale-neutral".

Could we add 'encoding' and maybe 'errors' arguments to read_edf, to make this explicit? Maybe with ISO-8859-1 or Windows-1252 as default.

Ivorforce commented 1 year ago

Yep, seems like iso8859-1 works too. I admit i blindly copied that encoding name from some Stackoverflow issue - eager to move forward - but reading up on it being python specific, it really doesn't make sense to use it.

I like your suggestion to make encoding a parameter. I'm updating the PR accordingly.

bemoody commented 1 year ago

It looks good to me. Thanks! I don't know why the test is failing :/

bemoody commented 1 year ago

Oh wait, it looks like you need to add the 'encoding' parameter to rdedfann() as well.

Ivorforce commented 1 year ago

Right, makes sense. Fixed the issue.

bemoody commented 1 year ago

Looks great!