alexdej / puzpy

Python library for reading and writing across lite crossword puzzle .puz files.
MIT License
112 stars 32 forks source link

UTF-8 characters in puzzles not readable #30

Open amclauth opened 7 months ago

amclauth commented 7 months ago

FIrst off, good work! I found an "edge" case with the ISO-8859-1 encoding reads. Some puzzles have multi-byte characters in the clues. Probably even more true for other languages. https://ivan.mclauthlin.com/test/23-11-13-atl.puz is a small example. Clue 4 and 6 have emoji in them that look like they're UTF-8. I didn't debut into it to dig as I didn't see a super quick way to check. Thought I'd pass it along

alexdej commented 2 weeks ago

thank you for this. can you upload the original file? this looks edited. I am not seeing anything that looks like emoji in the binary file. image

amclauth commented 2 weeks ago

Well shoot. Sorry to waste your time. You're right that it's not in the .puz file at all. I'm using xword-dl to get the .puz files and it must either not be in the source or if that does any encoding itself the problem is there, not with this project. Thanks for checking. Wish I'd done some more digging first.

alexdej commented 2 weeks ago

xword-dl uses puz.py internally so it's possible there's an issue there. it would be helpful to get the raw bytes if possible.

alexdej commented 2 weeks ago

Screenshot 2024-06-13 200529

found the puzzle on atlantic . it looks like xword-dl is using puz.py to save the puzzle after scraping the clues / etc from the web site. not clear if the bug is in xword-dl or puz.py, but the emojis are definitely not making it to the .puz file. additionally the resulting .puz file crashes acrosslite 2.0.