BUG: fix encoding issues on windows for some formats

geopandas / pyogrio

Vectorized vector I/O using OGR

https://pyogrio.readthedocs.io

MIT License

258 stars 22 forks source link

BUG: fix encoding issues on windows for some formats #361

Closed theroggy closed 3 months ago

theroggy commented 4 months ago

I noticed in some geofileops tests that using pyogrio to write/read dataframes to ".csv" files gives encoding issues.

This PR fixes those.

Odd detail: I only saw this behaviour on the github CI windows systems: locally I couldn't reproduce. Apparently:

when running tests on my local windows locale.getprefferedencoding() returns "UTF-8", even though when I run locale.getprefferedencoding() in a seperate script it returns the typical one for windows: "cp1252".
in the tests on the github windows machines, locale.getprefferedencoding() returns "cp1252".

theroggy commented 4 months ago

XLSX gave UTF-8 decoding errors when reading a file written after the change. GDAL does say that XLSX needs UTF-8 (via the OLCStringsAsUTF8 capability), but this only worked for existing files, not for new files/layers being created.

After reporting this via https://github.com/OSGeo/gdal/pull/9295 and some more debugging and searching, this has already been fixed in GDAL in https://github.com/OSGeo/gdal/pull/9301. So, for GDAL >= 3.8.5 this will be fixed.

theroggy commented 4 months ago

Thanks for working on this! Still a bit unsure of the correct behavior for shapefiles when encoding is not provided by the user, but otherwise this looks reasonable.

I'm not sure about the best way forward either. I now implemented it the same as the behaviour in fiona and the default of gdal, but e.g. just always using UTF-8 sound reasonable to me as well.

Not sure why fiona and gdal default to the 1990's encoding, but possibly some applications e.g. don't use the ".cpg" properly when reading leading to more risk of compatibility issues?

jorisvandenbossche commented 4 months ago

From an ESRI page (https://support.esri.com/en-us/knowledge-base/read-and-write-shapefile-and-dbase-files-encoded-in-var-000013192):

Shapefiles can now be stored in UTF-8. However, shapefiles encoded in UTF-8 are only recognized in ArcMap, ArcCatalog and ArcGIS Pro.

Now, I don't know what this "only" means (i.e. what other ESRI products are not included in that list).

Anyway, given we were already using UTF-8 before and didn't yet get any complaints about that, I would personally keep that. That seems like a better default nowadays (while the default in fiona and GDAL probably stems from many years ago)

jorisvandenbossche commented 4 months ago

However, I would then maybe do that for all platforms, including Windows?

theroggy commented 4 months ago

However, I would then maybe do that for all platforms, including Windows?

I agree. If we would go for "UTF-8", I would also vote to do it for all platforms.

EDIT: interesting detail: on the same ESRI page, in the "Summary", they state this:

The default code page in a shapefile (.DBF) is set to UTF-8 (UNICODE). This is the default for current internationalization practices.

brendan-ward commented 4 months ago

All right - let's go with UTF-8 as the default for shapefiles on all platforms and revisit (by setting to ISO-8859-1) if we get errors reported by users.

theroggy commented 4 months ago

UTF-8 is now the default for Shapefile on all platforms...