costerwi / dwdatareader

Python module to interact with Dewesoft DWDataReaderLib shared library
MIT License
14 stars 9 forks source link

Default encoding is UTF-8? #64

Open InvncibiltyCloak opened 10 months ago

InvncibiltyCloak commented 10 months ago

First off, thanks for the great Dewesoft reader library. I was recently using it for my datafiles which are DXD and are created on a Windows x64, en-US machine.

The units had some unicode characters for degree symbol and ohms. When I imported it with this library it had the classic Å symbol which is the give away of reading UTF-8 binary data but assuming it should be decoded according to Windows codepage (looks like you have ISO-8859-1 chosen).

A quick peek into the python code and I saw this is extremely easy to fix in this library - just call dwdatareader.encoding = 'utf-8' and it gives the correctly decoded strings.

I just wanted to file an issue to bring up the fact that it appears that DewesoftX is encoding strings in UTF-8 and perhaps this library should change the default encoding to match?

Unfortunately I am only sample size of one and have not tested other locales or versions of Dewesoft, so I am not sure if this default encoding applies everywhere. Thanks for your time!

costerwi commented 9 months ago

Thanks for your comments! I'm glad you found it easy to override the encoding.

I cannot find the encoding documented anywhere. The default was set to ISO-8859-1 a long time ago, probably due to an observation like yours. It may have evolved since then. The fact that your Windows machine seems to be recording in UTF-8 seems to be good reason to change the assumed default to UTF-8.

fleimgruber commented 1 week ago

Thanks @InvncibiltyCloak for bringing this up. Changing the default encoding to UTF-8 seems reasonable. One consideration though would be to give users the option to explicitly set encodings to maintain backwards compatibility with other encodings, e.g. ISO-8859-1, in older files and with older DEWE stacks?

costerwi commented 1 week ago

I never had a good example to test the encoding so it is intentionally very easy for the user to specify:

import dewesoft as dw
dw.encoding='utf-8'

Unfortunately, the Dewesoft library sometimes appends junk characters to the end of strings which cause utf-8 decoding errors in python and fail the tests. If we change the default to utf-8 then we need to either ask Dewesoft fix their library or have python ignore these decoding errors.

fleimgruber commented 1 week ago

Ah I should have been more specific. I saw this global option, but wondered if all of the 10 or so usages of it should all use the same encoding, e.g. opening the file in

https://github.com/costerwi/dwdatareader/blob/e579a23739e08db9a42ed67fb66341fcb51722dd/dwdatareader/__init__.py#L388

vs decoding text values e.g. in

https://github.com/costerwi/dwdatareader/blob/e579a23739e08db9a42ed67fb66341fcb51722dd/dwdatareader/__init__.py#L88

But it was only guessing on my part without any evidence of different encodings actually occurring.

Unfortunately, the Dewesoft library sometimes appends junk characters to the end of strings which cause utf-8 decoding errors in python and fail the tests. If we change the default to utf-8 then we need to either ask Dewesoft fix their library or have python ignore these decoding errors.

That sounds annoying. I would guess that the junk characters are a result of the C lib interpreting parts of the memory as strings when it should not, i.e. string length mismatch at that level?