Non-ASCII DST header is not working: 'utf-8' codec can't decode byte 0xe4 in position 3: invalid continuation byte - Githubissues

EmbroidePy / pyembroidery

pyembroidery library for reading and writing a variety of embroidery formats.

MIT License

181 stars 33 forks source link

Non-ASCII DST header is not working: 'utf-8' codec can't decode byte 0xe4 in position 3: invalid continuation byte #83

Closed andreymal closed 4 years ago

andreymal commented 4 years ago

See this file: https://andreymal.org/files/emb/дизайн 1.DST

It was created using russian version of Tajima DG/ML by Pulse 14.

pyembroidery can't read this file:

>>> pyembroidery.read_dst('дизайн 1.DST')
  File "pyembroidery/DstReader.py", line 58, in dst_read_header
    header_string = header.decode('utf8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 3: invalid continuation byte

The header has a non-ASCII string "дизайн 1" that is actually encoded using ANSI encoding (Windows-1251 for Russia):

>>> print(header.decode('windows-1251'))
'LA:дизайн 1        \rST:   3294\rCO:  0\r+X:  298\r …

Could you provide a way to set a custom header encoding for the read_dst function?

tatarize commented 4 years ago

I am unsure of the actual functionality of unicode/windows-code-page headers. Other formatting can take a series of bytes for a single character, however the header limit is firmly limited at 512 bytes. Likewise each subsection with the DST is often quite also limited. I am not sure which embroidery machines would accept this and which would throw a fit.

That said, the robustness principle clearly applies. Be liberal with what you accept and conservative with what you produce. There's no ambiguity as to what the intention of that file actually is and thus any failure to read it should be classified a defect.

And while I would not produce such files, the mere existence of alternative formatting should not preclude reading of the file itself. It would seem the expected behavior should be to strip non-ascii metadata values (since I could not guarantee their effectiveness in all machines) but read all the data that can be read from the rest of the file.

I'll correct this error in short order.

tatarize commented 4 years ago

While this might still have some issues with the unicode filename handling in Python 2. My direct tests in Python 3 show it as corrected. #84

tatarize commented 4 years ago

This is corrected in 1.4.11 and pushed to the various branches.

tatarize commented 4 years ago

@andreymal thank you for bringing this to my attention. I believe this make for two solid bugs you've caught. Thanks again.

andreymal commented 4 years ago

I am not sure which embroidery machines would accept this and which would throw a fit.

There is a photo from Tajima TFMX-C1501 embroidery machine. The encoding is obviously incorrect, but it seems it's working (click to enlarge)

tatarize commented 4 years ago

Depending on how its implemented the files should work or not work. That's what that cyrillic text encodes as if treated as ascii. This machine seems like it accepts it, but I cannot vouchsafe all machines unless only ascii is used. Stripping the text seems like the safest action, which is what was done. While it is entirely possible to read that text in alternative encodings it would require knowing what those encodings are before hand, which I don't. So it seems like this example reinforces the path I took here. Try to read it in ascii, if I can't, then ignore the header information (none of it is critical anyhow).

andreymal commented 4 years ago

Stripping the text seems like the safest action

It might be better to strip only text (LA), but read other ASCII-only headers (ST, CO etc). However, it’s really not critical, and personally I do not need them; this issue is resolved for me, thanks :)

tatarize commented 4 years ago

While that defect is very minor, it is a defect. I am losing information I could otherwise preserve. The defect does actually violate some of the philosophy of maximal data preservation.

I'll make a comment in the code but hold off on fixing it. And leave this open until I've gotten around to it.

Clearly I'd have to seek through the elements in binary, then try to convert each binary block into encoded text. With some various error catches there if that parsing fails. I'm not quite in the mood to instantly move on preserving data within a corrupt header that will then just be thrown away because literally it's always better to rebuild that information rather than trust it.

tatarize commented 4 years ago

Fixed the minor remaining issue. #88.