biglocalnews / warn-scraper

Command-line interface for downloading WARN Act notices of qualified plant closings and mass layoffs from state government websites
https://warn-scraper.readthedocs.io
Apache License 2.0
28 stars 10 forks source link

UTF-8 implicity doing bad stuff on Windows #584

Open stucka opened 9 months ago

stucka commented 9 months ago

201 was caused by two different libraries making assumptions about how to save text files.

This appears to have been developed on Macs and Linux, upon which I think Python might be assuming everything gets saved as utf-8 by default. But on Windows, I think Python may be defaulting to its own character set, which IIRC is cp-1252. Character encoding is not explicit on any of these text file operations.

So on Windows many text operations are going to fail because UTF-8 characters are getting pulled from web pages and Excel files.

Per Pep 20, explicit is better than implicit.

It may not affect anything to permanently add encoding="utf-8" to these file operations, but I'd want to test that better first.

It's a decent bet similar flaws exist in warn-transformer, as well.

in cache

cache.write includes this: with open(out, "w", newline="") as fh:

cache.read includes this: with open(path, newline="") as infile:

cache.reaad_csv includes this: with open(path) as fh:

and in utils ...

utils.write_dict_rows_to_csv includes this: with open(output_path, mode, newline="") as f:

utils.write_dict_rows_to_csv includes this: with open(output_path, mode, newline="") as f:

stucka commented 9 months ago

To clarify: "It may not affect anything to permanently add encoding="utf-8" to these file operations, but I'd want to test that better first." -- I'd meant on Linux and Macs. Hopefully that would prevent other problems within Windows.

Opening a bug in warn-transformer along similar lines.

stucka commented 9 months ago

Looks like the encoding default is indeed UTF-8 on Linux and Mac, but cp1252 on Windows. Fixing the libraries would be easy.

https://peps.python.org/pep-0686/