geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
258 stars 22 forks source link

GBK encoded SHP file, read exception. #380

Closed stonereese closed 2 months ago

stonereese commented 3 months ago

Windows 10 professional 22H2 19045.4170

pyogrio == 0.7.2 GDAL == 3.8.4 fiona == 1.9.5

When using pyogrio.read_dataframe() to read a shp file, if the encoding of the dbf file is gbk, specify the parameter encoding='gbk' or encoding='cp936'. There are two exceptional situations encountered: when specifying encoding='cp936' in fiona, there are no similar issues:

  1. An encoding error will be reported: UnicodeDecodeError: 'gbk' codec can't decode byte 0xaf in position 17: illegal multibyte sequence. Specifying encoding as utf8 allows for normal reading of Chinese characters in field values;
  2. No error is reported, but garbled characters appear in field values.
brendan-ward commented 3 months ago

Thanks for the report. Are you able to share a small sample zip file (all files related to your shp) that reproduces this issue?

Is there a corresponding .cpg file present alongside your shp, and if so, what are the contents of that file?

stonereese commented 3 months ago

Thanks for the report. Are you able to share a small sample zip file (all files related to your shp) that reproduces this issue?

Is there a corresponding .cpg file present alongside your shp, and if so, what are the contents of that file?

The data mentioned above do not have cpg files. Most of the SHP data without a CPG file do not have garbled characters.These several SHP data are exceptions. Unzip the shp data in the attachment (theoretically they are all encoded in gbk). When using read_dataframe() to read, specifically: for file 01.shp, if encoding parameter is not set or set as 'utf8', there will be no garbled characters; setting encoding as 'cp936' will result in a coding error. For file 02.shp, setting encoding as 'cp936' will not cause garbled characters, while setting it as 'utf8' will. For file 03.shp, both setting encoding as 'cp936' and 'utf8' will result in garbled characters.

stonereese commented 3 months ago

Thanks for the report. Are you able to share a small sample zip file (all files related to your shp) that reproduces this issue?

Is there a corresponding .cpg file present alongside your shp, and if so, what are the contents of that file?

Hello, have you downloaded the zip attachment above? If you have, I will delete it.

brendan-ward commented 3 months ago

I have downloaded, and I am able to reproduce your findings, and am currently trying to get to the root of what is going on here.

If possible, please leave the sample files available for a bit longer, so that other maintainers here can use them for testing to either isolate the issue or review a fix when / if identified.

brendan-ward commented 3 months ago

Actually, I'm finding that 03.shp reads fine with cp936 encoding, though I thought it produced garbled characters too when I tried it last night.

I'm finding that Fiona produces the same default behavior as Pyogrio, so I'm wondering if system preferred encoding is different between our systems:

import locale
locale.getpreferredencoding()

Results of reading first value of XZQMC / xzqmc attributes:

pygrio detected encoding: UTF-8
read_dataframe: 01.shp (default): 001街道尚营社区
fiona_read: 01.shp (default): 001街道尚营社区
read_dataframe: 01.shp (UTF-8): 001街道尚营社区
read_dataframe: 01.shp (cp936): failed with exception
========================================
pygrio detected encoding: ISO-8859-1
read dataframe: 02.shp (default): Ô¬ÀÏׯ´å
fiona read: 02.shp (default): Ô¬ÀÏׯ´å
read dataframe: 02.shp (UTF-8): failed with exception
read dataframe: 02.shp (cp936): 袁老庄村
========================================
pygrio detected encoding: UTF-8
read dataframe: 03.shp (default): µËÖÝÊÐ
fiona read: 03.shp (default): µËÖÝÊÐ
read dataframe: 03.shp (UTF-8): µËÖÝÊÐ
read dataframe: 03.shp (cp936): 碌脣脰脻脢脨

If encoding is not provided, we check with GDAL to see if the dataset supports UTF-8, and otherwise fall back to ISO-8859-1 for shapefiles.

Here is what GDAL detects at a lower level:

> ogrinfo -json 01.shp 01

...
        "SHAPEFILE":{
          "ENCODING_FROM_LDID":"CP936",
          "LDID_VALUE":"77",
          "SOURCE_ENCODING":"CP936"
        }

> ogrinfo -json 02.shp 02

...
        "SHAPEFILE":{
          "SOURCE_ENCODING":""
        }
...

> ogrinfo -json 03.shp 03

...
        "SHAPEFILE":{
          "ENCODING_FROM_LDID":"ISO-8859-1",
          "LDID_VALUE":"87",
          "SOURCE_ENCODING":"ISO-8859-1"
        }
...

What this means is that the 3 files differ in terms of how GDAL is detecting their encoding from the .dbf files in the absence of the definitive .cpg files.

I'm still trying to trace this through, but it looks like GDAL is automatically decoding from the detected encoding to UTF-8 before we attempt to detect the encoding of the file. This would explain why GDAL is reporting that 01.shp and 03.shp report to us as UTF-8, whereas for 02.shp GDAL does not detect an encoding and thus allows us to specify one directly.

theroggy commented 3 months ago

I'm still trying to trace this through, but it looks like GDAL is automatically decoding from the detected encoding to UTF-8 before we attempt to detect the encoding of the file. This would explain why GDAL is reporting that 01.shp and 03.shp report to us as UTF-8, whereas for 02.shp GDAL does not detect an encoding and thus allows us to specify one directly.

This is to be expected for shapefile as shapefile is a "OLCStringsAsUTF8" format. So we don't do any detecting... this is handled fully by GDAL.

FYI:

if OGR_L_TestCapability(ogr_layer, OLCStringsAsUTF8):
        return 'UTF-8'
        # OGR_L_TestCapability returns True for OLCStringsAsUTF8 if GDAL hides encoding
        # complexities for this layer/driver type. In this case all string attribute
        # values have to be supplied in UTF-8 and values will be returned in UTF-8.
        # The encoding used to read/write under the hood depends on the driver used.
        # For layers/drivers where False is returned, the string values are written and
        # read without recoding. Hence, it is up to you to supply the data in the
        # appropriate encoding. More info:
        # https://gdal.org/development/rfc/rfc23_ogr_unicode.html#oftstring-oftstringlist-fields
        return "UTF-8"
theroggy commented 3 months ago

I had a quick look, and it seems that the encoding parameter doesn't do a lot in ogr_read... it should be passed as a dataset open option in ogr_open so GDAL can take it in account, but it isn't... so it seems there is something missing.

brendan-ward commented 3 months ago

Ok, I think I understand better what is going on here.

For 01.shp and 03.shp, GDAL auto-detected the native encoding based on information in the .dbf file, since there is no corresponding .cpg file that explicitly states the native encoding. For 03.shp, it determines the native encoding is ISO-8859-1, which is not correct.

Where GDAL auto-detects the encoding, it automatically decodes the native encoding to UTF-8 and then reports to us that the data are in UTF-8, so by default we do not do further decoding. For 01.shp, this mechanism works correctly because the native encoding is detected as cp936.

For 03.shp, GDAL decodes from ISO-8859-1 to UTF-8, which produces an incorrect intermediate in UTF-8, which for the set of characters present, appears to allow us to then decode to cp936 without hitting an encoding error, but presumably the resulting text is totally incorrect.

For 02.shp, GDAL is unable to auto-detect the encoding, so it returns us the text in the native encoding, which we then by default decode to ISO-8859-1 as is the standard for shapefiles when not otherwise stated, but wrong in this case. But because the original text is not decoded by GDAL before we read it, this allows us to specifically set the correct encoding when passed by the user.

There are a couple of ways to sidestep the above issues: 1) create a *.cpg file for each shapefile that explicitly states the encoding as cp936 (the only file contents are cp936). GDAL gives preference to .cpg files for encoding, and then automatically decodes from the native encoding to UTF-8 that we can then return by default. In this case, do not pass in encoding="cp936". 2) set the SHAPE_ENCODING config option:

from pyogrio import set_gdal_config_options

set_gdal_config_options({"SHAPE_ENCODING": "cp936"})

Note: this then applies to all read operations; I need to check for a dataset / layer read option.

We may also need to set options when encoding is passed to pyogrio, to prevent GDAL first decoding to UTF-8 before we then try to decode using the user-specified encoding. From the above we don't appear to be doing the right thing.

brendan-ward commented 3 months ago

@theroggy thanks for looking at this too; I'm not terribly familiar with alternative encodings.

I'm starting to wonder if we should not be trying to decode via user-passed encoding option if GDAL reports to us that the encoding is UTF-8, either because that is the native encoding or - in this case - it automatically converted for us.

Like you say, for reading shapefiles, we need to be opting out of GDAL's auto detection when the user passes an encoding, so that we're always decoding from that specified encoding.

Some related bits in Fiona for further investigation: Fiona #516, Fiona #512

brendan-ward commented 3 months ago

It looks like Fiona as removed anything that was directly setting SHAPE_ENCODING more recently; github searches are proving unhelpful there and I'm not finding commits / issues that reference why it was removed.

It looks like we can use the open option ENCODING="" to explicitly disable auto decoding to UTF-8 by GDAL, but the problem is that it is an open option and we don't know the driver of a data source until opening it, and other drivers do not necessarily support ENCODING as an open option and will raise a warning. Given that there are multiple ways a shapefile could be represented, we can't do a simple check for .shp suffix.

stonereese commented 2 months ago

It looks like Fiona as removed anything that was directly setting SHAPE_ENCODING more recently; github searches are proving unhelpful there and I'm not finding commits / issues that reference why it was removed.

It looks like we can use the open option ENCODING="" to explicitly disable auto decoding to UTF-8 by GDAL, but the problem is that it is an open option and we don't know the driver of a data source until opening it, and other drivers do not necessarily support ENCODING as an open option and will raise a warning. Given that there are multiple ways a shapefile could be represented, we can't do a simple check for .shp suffix.

Thank you so much for your help. When I noticed garbled text in the feedback, I realized that I forgot to mention that the shp file also displayed chinese garbled text when using the sql parameter in read_dataframe(), regardless of the specified encoding parameter. I believe your fix has also resolved this issue. I am currently on vacation today, but I will verify it tomorrow.

brendan-ward commented 2 months ago

resolved by #380