Closed jkboyce closed 2 years ago
Update -- Some findings on .puz file formats:
Given the above, I think it best to stick with v1.3 of the .puz file format, and deal with non-Latin-1 characters via conversion, as in this pull request.
It's an interesting problem.
Here are all of the non-7-bit-clean characters that have appeared in a clue for the web (the clues in .puz files and the clues in the website are mostly the same, but not always .. there are some differences that are clearly the result of human error, not machine processing .. which itself is an interesting thing). glyph_counts.txt includes the count, recent up till 2021-09-23, including the variety puzzles (and missing only a handful of special cases where NYT published more than one main crossword on a day).
@jkboyce If you want, feel free to increase the list of replacement glyphs using this as a guide. I'll take a stab if you don't, but I suspect I won't be able to get around to it for a while, certainly this list is good enough for now.
@jkboyce thank you!
Today's (Sep. 23, 2021) puzzle has a Unicode character (em dash, Unicode '\u2014') in some of the clues, which causes puz.py to throw an exception because the character is outside of Latin-1.
There are two ways to handle this. Method 1 is this pull request, which substitutes sensible Latin-1 alternatives for non-Latin-1 Unicode characters.
The second method would be to output puz files in the v2 format, which supports UTF-8 encodings. The version of puzpy you're using supports this. I haven't investigated how widely-supported the v2 format is; I'll do some further research. In the meantime this pull request addresses the issue.