Closed andreymal closed 4 years ago
I am unsure of the actual functionality of unicode/windows-code-page headers. Other formatting can take a series of bytes for a single character, however the header limit is firmly limited at 512 bytes. Likewise each subsection with the DST is often quite also limited. I am not sure which embroidery machines would accept this and which would throw a fit.
That said, the robustness principle clearly applies. Be liberal with what you accept and conservative with what you produce. There's no ambiguity as to what the intention of that file actually is and thus any failure to read it should be classified a defect.
And while I would not produce such files, the mere existence of alternative formatting should not preclude reading of the file itself. It would seem the expected behavior should be to strip non-ascii metadata values (since I could not guarantee their effectiveness in all machines) but read all the data that can be read from the rest of the file.
I'll correct this error in short order.
While this might still have some issues with the unicode filename handling in Python 2. My direct tests in Python 3 show it as corrected. #84
This is corrected in 1.4.11 and pushed to the various branches.
@andreymal thank you for bringing this to my attention. I believe this make for two solid bugs you've caught. Thanks again.
Depending on how its implemented the files should work or not work. That's what that cyrillic text encodes as if treated as ascii. This machine seems like it accepts it, but I cannot vouchsafe all machines unless only ascii is used. Stripping the text seems like the safest action, which is what was done. While it is entirely possible to read that text in alternative encodings it would require knowing what those encodings are before hand, which I don't. So it seems like this example reinforces the path I took here. Try to read it in ascii, if I can't, then ignore the header information (none of it is critical anyhow).
Stripping the text seems like the safest action
It might be better to strip only text (LA), but read other ASCII-only headers (ST, CO etc). However, it’s really not critical, and personally I do not need them; this issue is resolved for me, thanks :)
While that defect is very minor, it is a defect. I am losing information I could otherwise preserve. The defect does actually violate some of the philosophy of maximal data preservation.
I'll make a comment in the code but hold off on fixing it. And leave this open until I've gotten around to it.
Clearly I'd have to seek through the elements in binary, then try to convert each binary block into encoded text. With some various error catches there if that parsing fails. I'm not quite in the mood to instantly move on preserving data within a corrupt header that will then just be thrown away because literally it's always better to rebuild that information rather than trust it.
Fixed the minor remaining issue. #88.
See this file: https://andreymal.org/files/emb/дизайн 1.DST
It was created using russian version of Tajima DG/ML by Pulse 14.
pyembroidery can't read this file:
The header has a non-ASCII string "дизайн 1" that is actually encoded using ANSI encoding (Windows-1251 for Russia):
Could you provide a way to set a custom header encoding for the
read_dst
function?