Open alexdewar opened 2 months ago
I can take a look at this.
I'm not sure if it's a great idea to just enforce utf-8? Perhaps we can pass that as a default (for all OS), and then if a custom encoding is set in the yaml header, we attempt to use that.
Go ahead with it. To be honest, I am not really sure of the implications.
I agree it will be useful to allow the user to specify the encoding. If we do so, it should be a top level parameter to the read/write functions, not included within the header, because if we try to read the header and the file has the wrong encoding, we won't be able to read it! So we need to know it upfront.
Currently,
pycsvy
's various reader functions don't explicitly specify a desired text-encoding to the underlying calls toopen()
, which means that the platform's default will be used.On Linux and macOS, the default encoding is UTF-8 but for unfortunate legacy reasons the default encoding on Windows is
cp1252
. Even worse, this encoding doesn't support Unicode, so if you try writing data containing Unicode to file, you'll get an error :disappointed:. This actually crops up more often than you'd think, in data sets containing names with accents, for example (looking at you @dalonsoa...) and I've been bitten by it a few times recently.It looks like UTF-8 will be made the default, but not until Python 3.15: https://peps.python.org/pep-0686/. In the meantime I think it probably makes sense to just make UTF-8 for
pycsvy
, regardless of the platform.Another consideration is that users can specify the encoding in the YAML header. We have a couple of options here: