geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
258 stars 22 forks source link

FIX: correctly use GDAL auto-decoding of shapefiles when encoding is set #384

Closed brendan-ward closed 2 months ago

brendan-ward commented 3 months ago

As described in #380, GDAL attempts to auto-detect the native encoding of a shapefile and will automatically use that to decode to UTF-8 before returning strings to us. This runs into conflicts where the user provides encoding because we then apply the user's encoding to UTF-8 text, which produces incorrect text.

This sets SHAPE_ENCODING="" (GDAL config option) to disable auto-decoding before opening the shapefile. Setting the dataset open option ENCODING="" also works, but is specific to Shapefiles, and we don't know the path resolves to a Shapefile until after opening it - which is too late to set the option.

This now correctly handles encoding="cp936" for the shapefiles described in #380. I'm still trying to create some test cases here, but running into issues saving to the correct native encoding to prove that.

theroggy commented 3 months ago

This sets SHAPE_ENCODING="" (GDAL config option) to disable auto-decoding before opening the shapefile. Setting the dataset open option ENCODING="" also works, but is specific to Shapefiles, and we don't know the path resolves to a Shapefile until after opening it - which is too late to set the option.

The only disadvantage I see from using the config option vs the dataset open option is that it can lead to inter-thread side effects. So if someone in one thread uses the encoding parameter, all other threads running in the same process at the same time to read/write data can be influenced by this and get encoding issues. This can be avoided by first opening the file to check the driver when encoding is specified to then be able to use the dataset open option... but I do agree it sounds like a far fetched scenario, even for a multithreading racing condition.

A final alternative is to pass the encoding specified to GDAL as a dataset open option, but this also needs a driver check to be sure it is a shapefile. Not sure if GDAL does some special/other things involving support of encodings compared to what we do in pyogrio... but if this is the case it might be "safer" just to delegate this to GDAL.

brendan-ward commented 3 months ago

Fiona used to use CPLSetThreadLocalConfigOption to set this option specifically within a thread; we can do that here too.

My concern with opening the file twice is that it might incur a lot of overhead for remote sources (e.g., shapefile in a zip on S3), but I don't have remote examples easy at hand to measure / prove that increase in overhead, it's just a guess. In contrast, setting the config option (so long as it doesn't negatively clobber other threads) should be low overhead.

stonereese commented 3 months ago

Fiona used to use CPLSetThreadLocalConfigOption to set this option specifically within a thread; we can do that here too.

My concern with opening the file twice is that it might incur a lot of overhead for remote sources (e.g., shapefile in a zip on S3), but I don't have remote examples easy at hand to measure / prove that increase in overhead, it's just a guess. In contrast, setting the config option (so long as it doesn't negatively clobber other threads) should be low overhead.

Thank you so much for your help. When I noticed garbled text in the feedback, I realized that I forgot to mention that the shp file also displayed garbled text when using the sql parameter in read_dataframe(), regardless of the specified encoding parameter. I believe your fix has also resolved this issue. I am currently on vacation today, but I will verify it tomorrow.

brendan-ward commented 2 months ago

Unfortunately I'm finding that using sql with a non-UTF-8 encoded file is returning incorrect column names / values when encoding is not also specified, even if GDAL detects the correct native encoding and should be decoding that for us. We might be missing a decoding step somewhere; investigating now.

I'm also finding that it is hard to use non-UTF-8 encodings with use_arrow=True because we don't decode the individual values; the raw values go directly into an Arrow table and immediately break certain things because they can't be decoded nicely. I was able to work around this for shapefiles by letting GDAL do the decoding for us, but I'm not sure how to do this for other drivers. Writing non-UTF values for some of those drivers is likely invalid, so it might not be an issue for those, but it seems theoretically possible for some of them (e.g., GeoJSON).

We can either disable the encoding option altogether for ogr_open_arrow - though this does work for shapefiles because GDAL decodes those to UTF-8 for us (per e738f7b).

@jorisvandenbossche I'm unfamiliar with handling of non-UTF-8 column names / text values in pyarrow - should we even be trying to allow non-UTF-8 content through the Arrow API? I'm not seeing how we would intercept and decode them on our end like we do for the non-Arrow API.

brendan-ward commented 2 months ago

So it turns out that - at least for shapefiles - the OGRLayer instance returned when using an SQL always fails the test for UTF-8 capabilities - even if there is a valid .cpg file present.

The workaround was to always test capabilities for shapefiles on the only layer that is present in the shapefile (e.g., the base layer underneath the SQL query layer).

I'm not sure the degree to which this might be present for other drivers; I think most of those already return true when testing for UTF-8 capabilities, so we didn't run into the same issue.

theroggy commented 2 months ago

Fiona used to use CPLSetThreadLocalConfigOption to set this option specifically within a thread; we can do that here too.

Yep, that would be great.

My concern with opening the file twice is that it might incur a lot of overhead for remote sources (e.g., shapefile in a zip on S3), but I don't have remote examples easy at hand to measure / prove that increase in overhead, it's just a guess. In contrast, setting the config option (so long as it doesn't negatively clobber other threads) should be low overhead.

Hmm... didn't considered this, also don't have any experience with using remote files.

Not really relevant anymore because of the threadsafe solution above, but if the overhead would be in the driver detection, it can be solved by specifying in GDALOpenEx already that it is an "ESRI Shapefile", so the driver detection is avoided/minimized. If this would be a significant overhead it might be a good idea in general to offer the option to specify the driver(s) to take in consideration to open the file.

theroggy commented 2 months ago

So it turns out that - at least for shapefiles - the OGRLayer instance returned when using an SQL always fails the test for UTF-8 capabilities - even if there is a valid .cpg file present.

Sounds like a bug in GDAL? The problem occurs for both sql dialects ('OGRSQL' and 'SQLITE')?

The workaround was to always test capabilities for shapefiles on the only layer that is present in the shapefile (e.g., the base layer underneath the SQL query layer).

I'm not sure the degree to which this might be present for other drivers; I think most of those already return true when testing for UTF-8 capabilities, so we didn't run into the same issue.

brendan-ward commented 2 months ago

Sounds like a bug in GDAL?

Not sure, it could also be a bug in how we are trying to use it on our end in this case. The GDAL Python bindings also return False for the capability but still return the correctly decoded value:

from osgeo import ogr

drv = ogr.GetDriverByName("ESRI Shapefile")
ds = drv.Open("/tmp/test.shp", 0)

lyr = ds.GetLayerByIndex(0)
print(f"Supports UTF-8: {lyr.TestCapability(ogr.OLCStringsAsUTF8)}")
# True
print(lyr.schema[0].name)
# 中文

lyr = ds.ExecuteSQL("select * from test where \"中文\" = '中文' ", None, "")
print(f"Supports UTF-8: {lyr.TestCapability(ogr.OLCStringsAsUTF8)}")
# False
print(lyr.schema[0].name)
# 中文

The Python bindings are a bit hard to trace through, I'm still trying to find where it determines the schema / field name in this case to see if it is doing something different than we are here.

jorisvandenbossche commented 2 months ago

@jorisvandenbossche I'm unfamiliar with handling of non-UTF-8 column names / text values in pyarrow - should we even be trying to allow non-UTF-8 content through the Arrow API? I'm not seeing how we would intercept and decode them on our end like we do for the non-Arrow API.

By definition Arrow strings are UTF-8, using anything else would be considered as invalid Arrow data. The only solution one has it to store it as a column with variable size binary data type (and keep track of the actual encoding as a user). But that only works for column values, I don't think there is a workaround for column names.

Given that you already solved this for Shapefiles, and that for other file formats this seems a dubious use case anyway(?), it's probably fine to not support this for reading with Arrow? I assume the only option to get this working is to have an upstream change in GDAL, for example by having an encoding option in OGR_L_GetArrowStream to allow the user to override this and let GDAL decode the read values to UFT-8 before putting that in the Arrow data.

jorisvandenbossche commented 2 months ago

But since for other formats there is no such thing as SHAPE_ENCODING, GDAL will still assume it is reading those files as UTF-8 and is returning UTF-8 values to us, right? How does that not give errors within GDAL?

Or GDAL just reads the data and passes them on, assuming it is UTF-8 but not actually doing anything with it? So as long as we then decode those bytes with the user provided encoding instead of UTF-8, this just happens to "work"? (but so eg looking at that file with ogrinfo would still given you garbage, because at that point GDAL will assume UTF-8?)

brendan-ward commented 2 months ago

Or GDAL just reads the data and passes them on, assuming it is UTF-8 but not actually doing anything with it? So as long as we then decode those bytes with the user provided encoding instead of UTF-8, this just happens to "work"

This is my assumption, based on the testing here. GDAL and other readers of those other drivers (GPKG, FGB, etc) will show miscoded field names / values. So it isn't terribly useful - seems like it enables breaking other tools in the ecosystem - but we don't (yet) specifically prevent it. I don't know if other tools are able to validly write non-UTF-8 encodings to some of these, but it appears to be possible in practice by converting shapefiles to some of them without correctly accounting for alternative encodings (e.g., GDAL #7458).

It would take a bit of research to determine the list, but it would in theory be possible to define a list of drivers that we allow to be non-UTF-8 on write (e.g., shapefile, maybe CSV, XLSX?), and raise an error on others.

We can leave it a bit more open-ended for reading, in part to allow users to specifically force a recoding to cleanup data. But we have to restrict that to the non-Arrow API.

theroggy commented 2 months ago

My current general understanding is as follows. There are 2 general types of format:

  1. "rude" formats that don't have a fixed encoding and don't have metadata with them to indicate the encoding.
    • OLCStringsAsUTF8 is False
    • e.g. "CSV"
    • for writing: GDAL doesn't do any conversions, it just writes the string/column data as it is.
    • for reading: GDAL doesn't do any conversions, it just writes the string/column data as it is.
  2. formats that support one or more encodings and have metadata about it, or the encoding can be deduced by GDAL.
    • OLCStringsAsUTF8 is True
    • e.g. "ESRI Shapefile", most xml or json formats,...
    • for writing: GDAL expects to receive UTF-8 and will write it in the default encoding of the format. As far as I know/found, "ESRI Shapefile" is the only one that supports to specify the encoding to use for writing (via layer creation option "ENCODING"), and then GDAL will do the conversion to the format specified.
    • for reading: GDAL will by default (try to) convert the data to UTF-8 based on the metadata found in the file. Depending on the file format, there are dataset open options to specify the to be used for reading, and then GDAL will do the conversion to UTF-8.
      • For "ESRI Shapefile", this options is called ENCODING, and you can specify ENCODING="" to avoid any recoding. Probably it is a pity this name is the same as the encoding parameter in pyogrio. If this wasn't the case, we could have just said: please specify the SHP_ENCODING open option.
      • For "DXF", this option is called DXF_ENCODING, and you can specify DXF_ENCODING="UTF-8" to avoid any recoding.
      • For most XML-based files (e.g. "GPX", "KML",...), the encoding should be correct in the xml file, you cannot overrule it software.
      • For json-based formats, I suppose GDAL autodetects the encoding based on the content.

To conclude, I think the "encoding" parameter is strictly speaking necessary for the first type of files (not sure if there are any others than CSV)... for the other formats everything could be controlled via open options, but due to the naming clash with the one for shapefile...

jorisvandenbossche commented 2 months ago
  • For "ESRI Shapefile", this options is called ENCODING, and you can specify ENCODING="" to avoid any recoding. Probably it is a pity this name is the same as the encoding parameter in pyogrio. If this wasn't the case, we could have just said: please specify the SHP_ENCODING open option.

BTW if there is a clash with one of our own keywords, you can always explicitly pass the GDAL layer/dataset open option through layer_options=dict(encoding="..")

theroggy commented 2 months ago

BTW if there is a clash with one of our own keywords, you can always explicitly pass the GDAL layer/dataset open option through layer_options=dict(encoding="..")

Indeed, but only for write_dataframe at the moment, not (yet) for read_dataframe.

jorisvandenbossche commented 2 months ago

Indeed, but only for write_dataframe at the moment, not (yet) for read_dataframe.

Ah, that's something we should probably add then.


For this PR, shall we merge it? (it just needs some fixing merge conflicts)

theroggy commented 2 months ago

Am I right that if a user passes encoding= for a file format having OLCStringsAsUTF8 == True that is not "ESRI Shapefile", it will be ignored? If so, maybe it is useful to show a warning in that case?

brendan-ward commented 2 months ago

if a user passes encoding= for a file format having OLCStringsAsUTF8 == True that is not "ESRI Shapefile", it will be ignored

For read (non-Arrow API) of other drivers than shapefile with the encoding parameter, we will attempt to use that encoding to decode the field names and string values regardless of OLCStringsAsUTF8 support. This breaks using the Arrow API because pyarrow expects everything in UTF-8 (per above).

For write of other drivers than shapefile (where we force UTF-8), we use the encoding parameter passed by the user to encode the field names and string values. This then renders them as potentially unreadable in other tools that expect UTF-8.

I'm not sure if we should equate support for OLCStringsAsUTF8 as an indication that the driver expects everything in UTF-8, only an indication that GDAL knows how to get to / from a different encoding and UTF-8 (isn't that what you were getting at above?).

If we know that a given driver must be in UTF-8, we could raise an Exception when attempting to write an alternative encoding.

Are you saying we should raise a warning if OLCStringsAsUTF8 == True that is not "ESRI Shapefile" during read? i.e., they know the data source is in a different encoding regardless of GDAL's capabilities to convert to UTF-8. Since `encoding' is opt-in, it seems like that is a deliberate choice and the warning just nags them about that, right? Maybe I don't follow what you are getting at here.

theroggy commented 2 months ago

if a user passes encoding= for a file format having OLCStringsAsUTF8 == True that is not "ESRI Shapefile", it will be ignored

For read (non-Arrow API) of other drivers than shapefile with the encoding parameter, we will attempt to use that encoding to decode the field names and string values regardless of OLCStringsAsUTF8 support. This breaks using the Arrow API because pyarrow expects everything in UTF-8 (per above).

For write of other drivers than shapefile (where we force UTF-8), we use the encoding parameter passed by the user to encode the field names and string values. This then renders them as potentially unreadable in other tools that expect UTF-8. I'm not sure if we should equate support for OLCStringsAsUTF8 as an indication that the driver expects everything in UTF-8, only an indication that GDAL knows how to get to / from a different encoding and UTF-8 (isn't that what you were getting at above?).

Yes, sorry, not sure why I wrote that, but it being ignored is indeed not at all the case... It is used, but I think the cases where it leads to something useful are very limited. Not 100% sure, but I personally think indeed that OLCStringsAsUTF8 is an indication that the driver almost always expects to receive UTF-8 data. The only possible exceptions I see are cases where the user specifies some specific GDAL options instructing GDAL not to do any encoding changes (e.g. https://gdal.org/drivers/vector/dxf.html#character-encodings). If data is provided in another encoding and the GDAL driver writes "UTF-8" by default anyway it won't recode and won't crash on it, as is the case I think in your tests. But if a recoding is needed and data is not provided to GDAL in UTF-8 I guess this can lead to errors.

If we know that a given driver must be in UTF-8, we could raise an Exception when attempting to write an alternative encoding.

I think this is the case... but as there are that many drivers, I'm not 100% sure, which is why I was thinking about a warning.

Are you saying we should raise a warning if OLCStringsAsUTF8 == True that is not "ESRI Shapefile" during read? i.e., they know the data source is in a different encoding regardless of GDAL's capabilities to convert to UTF-8. Since `encoding' is opt-in, it seems like that is a deliberate choice and the warning just nags them about that, right? Maybe I don't follow what you are getting at here.

True I suppose about the nagging. I'm afraid that few users will really know what they are doing when specifying "encoding=for aOLCStringsAsUTF8 == True` file, but if they would know, it is probably annoying that the warning is shown.

Possibly/probably it is best to just "let it go" and solve any issues if they would surface...

brendan-ward commented 2 months ago

I've tried to address some of the discussion in the latest changes:

For the Arrow API (open / read arrow) it will raise an exception if an alternative encoding is provided and driver is not shapefile

We now specifically raise exception for shapefiles if both encoding parameter and "ENCODING"="<whatever>" open / layer creation options are combined, because this leads to order of operations issues because we are specifically setting SHAPE_ENCODING or ENCODING on the user's behalf. Since SHAPE_ENCODING may be set in advance as a configuration option by the user (possibly re-used across many data sources), we don't raise an exception, we just silently override it (so there isn't a case where they can set it to SHAPE_ENCODING="" to disable recoding by GDAL and then pass in an alternative encoding parameter). This seems like advanced territory, so I didn't think it necessary to raise a warning until we start seeing issues.

This does not disable writing to an alternative encoding if provided by the user (hopefully they know the right thing to do for the target driver); I think we can leave this as is until we see issues.

brendan-ward commented 2 months ago

Added a specific test to verify that locale.getpreferredencoding() is used for CSV files (only runs on Windows though) when not providing encoding parameter.

brendan-ward commented 2 months ago

Added more tests for the Arrow write API, slimmed down the comments and repeated code around encoding in _io.pyx, and fixed an issue that would have bitten us immediately with the forthcoming GDAL 3.9: FlatGeobuff no longer allows writing non-UTF-8 values. We're already likely in unsupported territory with the tests that prove we can write non-UTF-8 values to drivers that likely expect UTF-8, but I left the other drivers in place for now because it proves the encoding works. Alternatively, we could simply block writing non-UTF-8 values for non-shapefile drivers where OLCStringsAsUTF8 is True.

jorisvandenbossche commented 2 months ago

Didn't notice it before (and haven't yet look in detail what might be causing it), but merging this caused a bunch failures when testing the wheels for ubuntu and older macos: https://github.com/geopandas/pyogrio/actions/runs/8848749467/job/24299655017