Make .encode(ENCODING) more robust

karlic commented 3 years ago

Using python 3.7.9 and import html and from puz import (Puzzle, DefaultClueNumbering, BLACKSQUARE)

I'm converting files from the web into PUZ format with puzpy. The files contain html entities which I'm unescaping with html.escape.

So, for example:

&mdash; --> \u2014
&rsquo; --> \u2019

This causes: UnicodeEncodeError: 'latin-1' codec can't encode character '\u2014' in position 28: ordinal not in range(256)

In puz.py, replacing every instance of .encode(ENCODING) with .encode(ENCODING, 'replace') solves the problem.

alexdej commented 3 years ago

Thanks for the report! Do you have a sample file you could link or attach that reproduces the issue? thanks!

alexdej commented 3 years ago

This is fixed in master now: you can override the encoding policy for the module like so:

import puz
puz.ENCODING_ERRORS = 'replace'
...

alexdej / puzpy