Closed brendan-ward closed 2 months ago
This sets
SHAPE_ENCODING=""
(GDAL config option) to disable auto-decoding before opening the shapefile. Setting the dataset open optionENCODING=""
also works, but is specific to Shapefiles, and we don't know the path resolves to a Shapefile until after opening it - which is too late to set the option.
The only disadvantage I see from using the config option vs the dataset open option is that it can lead to inter-thread side effects. So if someone in one thread uses the encoding
parameter, all other threads running in the same process at the same time to read/write data can be influenced by this and get encoding issues.
This can be avoided by first opening the file to check the driver when encoding is specified to then be able to use the dataset open option... but I do agree it sounds like a far fetched scenario, even for a multithreading racing condition.
A final alternative is to pass the encoding
specified to GDAL as a dataset open option, but this also needs a driver check to be sure it is a shapefile. Not sure if GDAL does some special/other things involving support of encodings compared to what we do in pyogrio... but if this is the case it might be "safer" just to delegate this to GDAL.
Fiona used to use CPLSetThreadLocalConfigOption
to set this option specifically within a thread; we can do that here too.
My concern with opening the file twice is that it might incur a lot of overhead for remote sources (e.g., shapefile in a zip on S3), but I don't have remote examples easy at hand to measure / prove that increase in overhead, it's just a guess. In contrast, setting the config option (so long as it doesn't negatively clobber other threads) should be low overhead.
Fiona used to use
CPLSetThreadLocalConfigOption
to set this option specifically within a thread; we can do that here too.My concern with opening the file twice is that it might incur a lot of overhead for remote sources (e.g., shapefile in a zip on S3), but I don't have remote examples easy at hand to measure / prove that increase in overhead, it's just a guess. In contrast, setting the config option (so long as it doesn't negatively clobber other threads) should be low overhead.
Thank you so much for your help. When I noticed garbled text in the feedback, I realized that I forgot to mention that the shp
file also displayed garbled text when using the sql
parameter in read_dataframe()
, regardless of the specified encoding
parameter. I believe your fix has also resolved this issue. I am currently on vacation today, but I will verify it tomorrow.
Unfortunately I'm finding that using sql
with a non-UTF-8 encoded file is returning incorrect column names / values when encoding
is not also specified, even if GDAL detects the correct native encoding and should be decoding that for us. We might be missing a decoding step somewhere; investigating now.
I'm also finding that it is hard to use non-UTF-8 encodings with use_arrow=True
because we don't decode the individual values; the raw values go directly into an Arrow table and immediately break certain things because they can't be decoded nicely. I was able to work around this for shapefiles by letting GDAL do the decoding for us, but I'm not sure how to do this for other drivers. Writing non-UTF values for some of those drivers is likely invalid, so it might not be an issue for those, but it seems theoretically possible for some of them (e.g., GeoJSON).
We can either disable the encoding
option altogether for ogr_open_arrow
- though this does work for shapefiles because GDAL decodes those to UTF-8 for us (per e738f7b).
@jorisvandenbossche I'm unfamiliar with handling of non-UTF-8 column names / text values in pyarrow - should we even be trying to allow non-UTF-8 content through the Arrow API? I'm not seeing how we would intercept and decode them on our end like we do for the non-Arrow API.
So it turns out that - at least for shapefiles - the OGRLayer
instance returned when using an SQL always fails the test for UTF-8
capabilities - even if there is a valid .cpg
file present.
The workaround was to always test capabilities for shapefiles on the only layer that is present in the shapefile (e.g., the base layer underneath the SQL query layer).
I'm not sure the degree to which this might be present for other drivers; I think most of those already return true when testing for UTF-8
capabilities, so we didn't run into the same issue.
Fiona used to use
CPLSetThreadLocalConfigOption
to set this option specifically within a thread; we can do that here too.
Yep, that would be great.
My concern with opening the file twice is that it might incur a lot of overhead for remote sources (e.g., shapefile in a zip on S3), but I don't have remote examples easy at hand to measure / prove that increase in overhead, it's just a guess. In contrast, setting the config option (so long as it doesn't negatively clobber other threads) should be low overhead.
Hmm... didn't considered this, also don't have any experience with using remote files.
Not really relevant anymore because of the threadsafe solution above, but if the overhead would be in the driver detection, it can be solved by specifying in GDALOpenEx already that it is an "ESRI Shapefile", so the driver detection is avoided/minimized. If this would be a significant overhead it might be a good idea in general to offer the option to specify the driver(s) to take in consideration to open the file.
So it turns out that - at least for shapefiles - the
OGRLayer
instance returned when using an SQL always fails the test forUTF-8
capabilities - even if there is a valid.cpg
file present.
Sounds like a bug in GDAL? The problem occurs for both sql dialects ('OGRSQL' and 'SQLITE')?
The workaround was to always test capabilities for shapefiles on the only layer that is present in the shapefile (e.g., the base layer underneath the SQL query layer).
I'm not sure the degree to which this might be present for other drivers; I think most of those already return true when testing for
UTF-8
capabilities, so we didn't run into the same issue.
Sounds like a bug in GDAL?
Not sure, it could also be a bug in how we are trying to use it on our end in this case. The GDAL Python bindings also return False
for the capability but still return the correctly decoded value:
from osgeo import ogr
drv = ogr.GetDriverByName("ESRI Shapefile")
ds = drv.Open("/tmp/test.shp", 0)
lyr = ds.GetLayerByIndex(0)
print(f"Supports UTF-8: {lyr.TestCapability(ogr.OLCStringsAsUTF8)}")
# True
print(lyr.schema[0].name)
# 中文
lyr = ds.ExecuteSQL("select * from test where \"中文\" = '中文' ", None, "")
print(f"Supports UTF-8: {lyr.TestCapability(ogr.OLCStringsAsUTF8)}")
# False
print(lyr.schema[0].name)
# 中文
The Python bindings are a bit hard to trace through, I'm still trying to find where it determines the schema / field name in this case to see if it is doing something different than we are here.
@jorisvandenbossche I'm unfamiliar with handling of non-UTF-8 column names / text values in pyarrow - should we even be trying to allow non-UTF-8 content through the Arrow API? I'm not seeing how we would intercept and decode them on our end like we do for the non-Arrow API.
By definition Arrow strings are UTF-8, using anything else would be considered as invalid Arrow data. The only solution one has it to store it as a column with variable size binary data type (and keep track of the actual encoding as a user). But that only works for column values, I don't think there is a workaround for column names.
Given that you already solved this for Shapefiles, and that for other file formats this seems a dubious use case anyway(?), it's probably fine to not support this for reading with Arrow?
I assume the only option to get this working is to have an upstream change in GDAL, for example by having an encoding
option in OGR_L_GetArrowStream
to allow the user to override this and let GDAL decode the read values to UFT-8 before putting that in the Arrow data.
But since for other formats there is no such thing as
SHAPE_ENCODING
, GDAL will still assume it is reading those files as UTF-8 and is returning UTF-8 values to us, right? How does that not give errors within GDAL?
Or GDAL just reads the data and passes them on, assuming it is UTF-8 but not actually doing anything with it? So as long as we then decode those bytes with the user provided encoding instead of UTF-8, this just happens to "work"? (but so eg looking at that file with ogrinfo
would still given you garbage, because at that point GDAL will assume UTF-8?)
Or GDAL just reads the data and passes them on, assuming it is UTF-8 but not actually doing anything with it? So as long as we then decode those bytes with the user provided encoding instead of UTF-8, this just happens to "work"
This is my assumption, based on the testing here. GDAL and other readers of those other drivers (GPKG, FGB, etc) will show miscoded field names / values. So it isn't terribly useful - seems like it enables breaking other tools in the ecosystem - but we don't (yet) specifically prevent it. I don't know if other tools are able to validly write non-UTF-8 encodings to some of these, but it appears to be possible in practice by converting shapefiles to some of them without correctly accounting for alternative encodings (e.g., GDAL #7458).
It would take a bit of research to determine the list, but it would in theory be possible to define a list of drivers that we allow to be non-UTF-8 on write (e.g., shapefile, maybe CSV, XLSX?), and raise an error on others.
We can leave it a bit more open-ended for reading, in part to allow users to specifically force a recoding to cleanup data. But we have to restrict that to the non-Arrow API.
My current general understanding is as follows. There are 2 general types of format:
SHP_ENCODING
open option.To conclude, I think the "encoding" parameter is strictly speaking necessary for the first type of files (not sure if there are any others than CSV)... for the other formats everything could be controlled via open options, but due to the naming clash with the one for shapefile...
- For "ESRI Shapefile", this options is called ENCODING, and you can specify ENCODING="" to avoid any recoding. Probably it is a pity this name is the same as the encoding parameter in pyogrio. If this wasn't the case, we could have just said: please specify the
SHP_ENCODING
open option.
BTW if there is a clash with one of our own keywords, you can always explicitly pass the GDAL layer/dataset open option through layer_options=dict(encoding="..")
BTW if there is a clash with one of our own keywords, you can always explicitly pass the GDAL layer/dataset open option through
layer_options=dict(encoding="..")
Indeed, but only for write_dataframe
at the moment, not (yet) for read_dataframe
.
Indeed, but only for
write_dataframe
at the moment, not (yet) forread_dataframe
.
Ah, that's something we should probably add then.
For this PR, shall we merge it? (it just needs some fixing merge conflicts)
Am I right that if a user passes encoding=
for a file format having OLCStringsAsUTF8 == True
that is not "ESRI Shapefile", it will be ignored?
If so, maybe it is useful to show a warning in that case?
if a user passes encoding= for a file format having OLCStringsAsUTF8 == True that is not "ESRI Shapefile", it will be ignored
For read (non-Arrow API) of other drivers than shapefile with the encoding
parameter, we will attempt to use that encoding to decode the field names and string values regardless of OLCStringsAsUTF8
support. This breaks using the Arrow API because pyarrow expects everything in UTF-8 (per above).
For write of other drivers than shapefile (where we force UTF-8), we use the encoding
parameter passed by the user to encode the field names and string values. This then renders them as potentially unreadable in other tools that expect UTF-8.
I'm not sure if we should equate support for OLCStringsAsUTF8
as an indication that the driver expects everything in UTF-8, only an indication that GDAL knows how to get to / from a different encoding and UTF-8 (isn't that what you were getting at above?).
If we know that a given driver must be in UTF-8, we could raise an Exception when attempting to write an alternative encoding.
Are you saying we should raise a warning if OLCStringsAsUTF8 == True that is not "ESRI Shapefile" during read? i.e., they know the data source is in a different encoding regardless of GDAL's capabilities to convert to UTF-8. Since `encoding' is opt-in, it seems like that is a deliberate choice and the warning just nags them about that, right? Maybe I don't follow what you are getting at here.
if a user passes encoding= for a file format having OLCStringsAsUTF8 == True that is not "ESRI Shapefile", it will be ignored
For read (non-Arrow API) of other drivers than shapefile with the
encoding
parameter, we will attempt to use that encoding to decode the field names and string values regardless ofOLCStringsAsUTF8
support. This breaks using the Arrow API because pyarrow expects everything in UTF-8 (per above).For write of other drivers than shapefile (where we force UTF-8), we use the
encoding
parameter passed by the user to encode the field names and string values. This then renders them as potentially unreadable in other tools that expect UTF-8. I'm not sure if we should equate support forOLCStringsAsUTF8
as an indication that the driver expects everything in UTF-8, only an indication that GDAL knows how to get to / from a different encoding and UTF-8 (isn't that what you were getting at above?).
Yes, sorry, not sure why I wrote that, but it being ignored is indeed not at all the case... It is used, but I think the cases where it leads to something useful are very limited. Not 100% sure, but I personally think indeed that OLCStringsAsUTF8
is an indication that the driver almost always expects to receive UTF-8 data. The only possible exceptions I see are cases where the user specifies some specific GDAL options instructing GDAL not to do any encoding changes (e.g. https://gdal.org/drivers/vector/dxf.html#character-encodings).
If data is provided in another encoding and the GDAL driver writes "UTF-8" by default anyway it won't recode and won't crash on it, as is the case I think in your tests. But if a recoding is needed and data is not provided to GDAL in UTF-8 I guess this can lead to errors.
If we know that a given driver must be in UTF-8, we could raise an Exception when attempting to write an alternative encoding.
I think this is the case... but as there are that many drivers, I'm not 100% sure, which is why I was thinking about a warning.
Are you saying we should raise a warning if OLCStringsAsUTF8 == True that is not "ESRI Shapefile" during read? i.e., they know the data source is in a different encoding regardless of GDAL's capabilities to convert to UTF-8. Since `encoding' is opt-in, it seems like that is a deliberate choice and the warning just nags them about that, right? Maybe I don't follow what you are getting at here.
True I suppose about the nagging. I'm afraid that few users will really know what they are doing when specifying "encoding=for a
OLCStringsAsUTF8 == True` file, but if they would know, it is probably annoying that the warning is shown.
Possibly/probably it is best to just "let it go" and solve any issues if they would surface...
I've tried to address some of the discussion in the latest changes:
For the Arrow API (open / read arrow) it will raise an exception if an alternative encoding is provided and driver is not shapefile
We now specifically raise exception for shapefiles if both encoding
parameter and "ENCODING"="<whatever>"
open / layer creation options are combined, because this leads to order of operations issues because we are specifically setting SHAPE_ENCODING
or ENCODING
on the user's behalf. Since SHAPE_ENCODING
may be set in advance as a configuration option by the user (possibly re-used across many data sources), we don't raise an exception, we just silently override it (so there isn't a case where they can set it to SHAPE_ENCODING=""
to disable recoding by GDAL and then pass in an alternative encoding
parameter). This seems like advanced territory, so I didn't think it necessary to raise a warning until we start seeing issues.
This does not disable writing to an alternative encoding if provided by the user (hopefully they know the right thing to do for the target driver); I think we can leave this as is until we see issues.
Added a specific test to verify that locale.getpreferredencoding()
is used for CSV files (only runs on Windows though) when not providing encoding
parameter.
Added more tests for the Arrow write API, slimmed down the comments and repeated code around encoding in _io.pyx
, and fixed an issue that would have bitten us immediately with the forthcoming GDAL 3.9: FlatGeobuff no longer allows writing non-UTF-8 values. We're already likely in unsupported territory with the tests that prove we can write non-UTF-8 values to drivers that likely expect UTF-8, but I left the other drivers in place for now because it proves the encoding works. Alternatively, we could simply block writing non-UTF-8 values for non-shapefile drivers where OLCStringsAsUTF8
is True
.
Didn't notice it before (and haven't yet look in detail what might be causing it), but merging this caused a bunch failures when testing the wheels for ubuntu and older macos: https://github.com/geopandas/pyogrio/actions/runs/8848749467/job/24299655017
As described in #380, GDAL attempts to auto-detect the native encoding of a shapefile and will automatically use that to decode to UTF-8 before returning strings to us. This runs into conflicts where the user provides
encoding
because we then apply the user'sencoding
to UTF-8 text, which produces incorrect text.This sets
SHAPE_ENCODING=""
(GDAL config option) to disable auto-decoding before opening the shapefile. Setting the dataset open optionENCODING=""
also works, but is specific to Shapefiles, and we don't know the path resolves to a Shapefile until after opening it - which is too late to set the option.This now correctly handles
encoding="cp936"
for the shapefiles described in #380. I'm still trying to create some test cases here, but running into issues saving to the correct native encoding to prove that.