ImperialCollegeLondon / pycsvy

Python reader/writer for CSV files with YAML header information
BSD 3-Clause "New" or "Revised" License
7 stars 7 forks source link

`pycsvy` won't work with Unicode data on Windows #86

Open alexdewar opened 2 months ago

alexdewar commented 2 months ago

Currently, pycsvy's various reader functions don't explicitly specify a desired text-encoding to the underlying calls to open(), which means that the platform's default will be used.

On Linux and macOS, the default encoding is UTF-8 but for unfortunate legacy reasons the default encoding on Windows is cp1252. Even worse, this encoding doesn't support Unicode, so if you try writing data containing Unicode to file, you'll get an error :disappointed:. This actually crops up more often than you'd think, in data sets containing names with accents, for example (looking at you @dalonsoa...) and I've been bitten by it a few times recently.

It looks like UTF-8 will be made the default, but not until Python 3.15: https://peps.python.org/pep-0686/. In the meantime I think it probably makes sense to just make UTF-8 for pycsvy, regardless of the platform.

Another consideration is that users can specify the encoding in the YAML header. We have a couple of options here:

  1. If the user specifies an encoding other than UTF-8, we raise an error/warning and use UTF-8 anyway
  2. We use the user's specified encoding to parse the file (not sure whether we should also read the header with this encoding -- which will require reading it twice -- or just the CSV portion of the file)
dc2917 commented 1 month ago

I can take a look at this.

I'm not sure if it's a great idea to just enforce utf-8? Perhaps we can pass that as a default (for all OS), and then if a custom encoding is set in the yaml header, we attempt to use that.

dalonsoa commented 1 month ago

Go ahead with it. To be honest, I am not really sure of the implications.

I agree it will be useful to allow the user to specify the encoding. If we do so, it should be a top level parameter to the read/write functions, not included within the header, because if we try to read the header and the file has the wrong encoding, we won't be able to read it! So we need to know it upfront.