Q726kbXuN / nytxw_puz

Turn NY Times crosswords into Across Lite files
The Unlicense
41 stars 10 forks source link

Handle non-Latin-1 characters properly #6

Closed jkboyce closed 2 years ago

jkboyce commented 2 years ago

Today's (Sep. 23, 2021) puzzle has a Unicode character (em dash, Unicode '\u2014') in some of the clues, which causes puz.py to throw an exception because the character is outside of Latin-1.

There are two ways to handle this. Method 1 is this pull request, which substitutes sensible Latin-1 alternatives for non-Latin-1 Unicode characters.

The second method would be to output puz files in the v2 format, which supports UTF-8 encodings. The version of puzpy you're using supports this. I haven't investigated how widely-supported the v2 format is; I'll do some further research. In the meantime this pull request addresses the issue.

jkboyce commented 2 years ago

Update -- Some findings on .puz file formats:

Given the above, I think it best to stick with v1.3 of the .puz file format, and deal with non-Latin-1 characters via conversion, as in this pull request.

Q726kbXuN commented 2 years ago

It's an interesting problem.

Here are all of the non-7-bit-clean characters that have appeared in a clue for the web (the clues in .puz files and the clues in the website are mostly the same, but not always .. there are some differences that are clearly the result of human error, not machine processing .. which itself is an interesting thing). glyph_counts.txt includes the count, recent up till 2021-09-23, including the variety puzzles (and missing only a handful of special cases where NYT published more than one main crossword on a day).

@jkboyce If you want, feel free to increase the list of replacement glyphs using this as a guide. I'll take a stab if you don't, but I suspect I won't be able to get around to it for a while, certainly this list is good enough for now.

edsantiago commented 2 years ago

@jkboyce thank you!