Closed theroggy closed 3 months ago
XLSX gave UTF-8 decoding errors when reading a file written after the change. GDAL does say that XLSX needs UTF-8 (via the OLCStringsAsUTF8 capability), but this only worked for existing files, not for new files/layers being created.
After reporting this via https://github.com/OSGeo/gdal/pull/9295 and some more debugging and searching, this has already been fixed in GDAL in https://github.com/OSGeo/gdal/pull/9301. So, for GDAL >= 3.8.5 this will be fixed.
Thanks for working on this! Still a bit unsure of the correct behavior for shapefiles when
encoding
is not provided by the user, but otherwise this looks reasonable.
I'm not sure about the best way forward either. I now implemented it the same as the behaviour in fiona and the default of gdal, but e.g. just always using UTF-8 sound reasonable to me as well.
Not sure why fiona and gdal default to the 1990's encoding, but possibly some applications e.g. don't use the ".cpg" properly when reading leading to more risk of compatibility issues?
From an ESRI page (https://support.esri.com/en-us/knowledge-base/read-and-write-shapefile-and-dbase-files-encoded-in-var-000013192):
Shapefiles can now be stored in UTF-8. However, shapefiles encoded in UTF-8 are only recognized in ArcMap, ArcCatalog and ArcGIS Pro.
Now, I don't know what this "only" means (i.e. what other ESRI products are not included in that list).
Anyway, given we were already using UTF-8 before and didn't yet get any complaints about that, I would personally keep that. That seems like a better default nowadays (while the default in fiona and GDAL probably stems from many years ago)
However, I would then maybe do that for all platforms, including Windows?
However, I would then maybe do that for all platforms, including Windows?
I agree. If we would go for "UTF-8", I would also vote to do it for all platforms.
EDIT: interesting detail: on the same ESRI page, in the "Summary", they state this:
The default code page in a shapefile (.DBF) is set to UTF-8 (UNICODE). This is the default for current internationalization practices.
All right - let's go with UTF-8
as the default for shapefiles on all platforms and revisit (by setting to ISO-8859-1
) if we get errors reported by users.
UTF-8 is now the default for Shapefile on all platforms...
I noticed in some geofileops tests that using pyogrio to write/read dataframes to ".csv" files gives encoding issues.
This PR fixes those.
Odd detail: I only saw this behaviour on the github CI windows systems: locally I couldn't reproduce. Apparently:
locale.getprefferedencoding()
returns "UTF-8", even though when I runlocale.getprefferedencoding()
in a seperate script it returns the typical one for windows: "cp1252".locale.getprefferedencoding()
returns "cp1252".