freelawproject / reporters-db

A database of court reporters, tests and other experiments
BSD 2-Clause "Simplified" License
93 stars 34 forks source link

Importing laws.json converts § to § (utf-8 encoding) #72

Closed munyanjacob closed 2 years ago

munyanjacob commented 2 years ago

My fork fails tests.py due to all § symbols in laws.json being converted to § when imported. Updating __init__.py to open laws.json using encoding="utf-8" fixes it and allows it to past the tests.

I am on Windows, so maybe that is why it only seems to be an issue for me. I have not tried reproducing this on other machines, but it is reproducible on mine. I first discovered it while using eyecite.

I'm new to open source - would this have been a good thing to create a pull request for? Please give any advice or feedback! Thanks

image

jcushman commented 2 years ago

A pull request sounds good to me!

FYI it sounds like you can avoid this kind of issue on Windows, if you're running python 3.7+, by setting the -X utf8 command line option or PYTHONUTF8 environment variable, and that's probably the best default for most users since it makes Windows match the behavior of Mac and Linux. But no harm in patching reporters-db as well.

mlissner commented 2 years ago

Something is funky here, no? I thought Python was always utf-8 these days unless overridden? It'd surprise me if Windows is overriding this by default. (I'm not an expert here.)

munyanjacob commented 2 years ago

Something is funky here, no? I thought Python was always utf-8 these days unless overridden?

You're right - the document jcushman sent says Python uses utf-8, and my default encoder is utf-8, but it turns out there is an exception where Python's open() uses the locale encoder. Although all of my locale variables are set to use utf-8, my locale encoder is cp1252:

image

Adding -X utf8 or using export PYTHONUTF8=1 seem to work because they override the locale setting:

image

I will make a pull request adding encoding="utf-8" when loading laws.json in case other users are cursed by weird Windows locale encoders. Thanks both of you!

mlissner commented 2 years ago

Thank you again. I think I just muddied the waters, so sorry about that. Happy to have any other fixes like this that your see.