Closed stonereese closed 2 months ago
Thanks for the report. Are you able to share a small sample zip file (all files related to your shp) that reproduces this issue?
Is there a corresponding .cpg
file present alongside your shp, and if so, what are the contents of that file?
Thanks for the report. Are you able to share a small sample zip file (all files related to your shp) that reproduces this issue?
Is there a corresponding
.cpg
file present alongside your shp, and if so, what are the contents of that file?
The data mentioned above do not have cpg files. Most of the SHP data without a CPG file do not have garbled characters.These several SHP data are exceptions. Unzip the shp data in the attachment (theoretically they are all encoded in gbk). When using read_dataframe() to read, specifically: for file 01.shp, if encoding parameter is not set or set as 'utf8', there will be no garbled characters; setting encoding as 'cp936' will result in a coding error. For file 02.shp, setting encoding as 'cp936' will not cause garbled characters, while setting it as 'utf8' will. For file 03.shp, both setting encoding as 'cp936' and 'utf8' will result in garbled characters.
Thanks for the report. Are you able to share a small sample zip file (all files related to your shp) that reproduces this issue?
Is there a corresponding
.cpg
file present alongside your shp, and if so, what are the contents of that file?
Hello, have you downloaded the zip attachment above? If you have, I will delete it.
I have downloaded, and I am able to reproduce your findings, and am currently trying to get to the root of what is going on here.
If possible, please leave the sample files available for a bit longer, so that other maintainers here can use them for testing to either isolate the issue or review a fix when / if identified.
Actually, I'm finding that 03.shp
reads fine with cp936
encoding, though I thought it produced garbled characters too when I tried it last night.
I'm finding that Fiona produces the same default behavior as Pyogrio, so I'm wondering if system preferred encoding is different between our systems:
import locale
locale.getpreferredencoding()
Results of reading first value of XZQMC
/ xzqmc
attributes:
pygrio detected encoding: UTF-8
read_dataframe: 01.shp (default): 001街道尚营社区
fiona_read: 01.shp (default): 001街道尚营社区
read_dataframe: 01.shp (UTF-8): 001街道尚营社区
read_dataframe: 01.shp (cp936): failed with exception
========================================
pygrio detected encoding: ISO-8859-1
read dataframe: 02.shp (default): Ô¬ÀÏׯ´å
fiona read: 02.shp (default): Ô¬ÀÏׯ´å
read dataframe: 02.shp (UTF-8): failed with exception
read dataframe: 02.shp (cp936): 袁老庄村
========================================
pygrio detected encoding: UTF-8
read dataframe: 03.shp (default): µËÖÝÊÐ
fiona read: 03.shp (default): µËÖÝÊÐ
read dataframe: 03.shp (UTF-8): µËÖÝÊÐ
read dataframe: 03.shp (cp936): 碌脣脰脻脢脨
If encoding is not provided, we check with GDAL to see if the dataset supports UTF-8, and otherwise fall back to ISO-8859-1
for shapefiles.
Here is what GDAL detects at a lower level:
> ogrinfo -json 01.shp 01
...
"SHAPEFILE":{
"ENCODING_FROM_LDID":"CP936",
"LDID_VALUE":"77",
"SOURCE_ENCODING":"CP936"
}
> ogrinfo -json 02.shp 02
...
"SHAPEFILE":{
"SOURCE_ENCODING":""
}
...
> ogrinfo -json 03.shp 03
...
"SHAPEFILE":{
"ENCODING_FROM_LDID":"ISO-8859-1",
"LDID_VALUE":"87",
"SOURCE_ENCODING":"ISO-8859-1"
}
...
What this means is that the 3 files differ in terms of how GDAL is detecting their encoding from the .dbf
files in the absence of the definitive .cpg
files.
I'm still trying to trace this through, but it looks like GDAL is automatically decoding from the detected encoding to UTF-8 before we attempt to detect the encoding of the file. This would explain why GDAL is reporting that 01.shp
and 03.shp
report to us as UTF-8, whereas for 02.shp
GDAL does not detect an encoding and thus allows us to specify one directly.
I'm still trying to trace this through, but it looks like GDAL is automatically decoding from the detected encoding to UTF-8 before we attempt to detect the encoding of the file. This would explain why GDAL is reporting that
01.shp
and03.shp
report to us as UTF-8, whereas for02.shp
GDAL does not detect an encoding and thus allows us to specify one directly.
This is to be expected for shapefile as shapefile is a "OLCStringsAsUTF8" format. So we don't do any detecting... this is handled fully by GDAL.
FYI:
if OGR_L_TestCapability(ogr_layer, OLCStringsAsUTF8):
return 'UTF-8'
# OGR_L_TestCapability returns True for OLCStringsAsUTF8 if GDAL hides encoding
# complexities for this layer/driver type. In this case all string attribute
# values have to be supplied in UTF-8 and values will be returned in UTF-8.
# The encoding used to read/write under the hood depends on the driver used.
# For layers/drivers where False is returned, the string values are written and
# read without recoding. Hence, it is up to you to supply the data in the
# appropriate encoding. More info:
# https://gdal.org/development/rfc/rfc23_ogr_unicode.html#oftstring-oftstringlist-fields
return "UTF-8"
I had a quick look, and it seems that the encoding
parameter doesn't do a lot in ogr_read
... it should be passed as a dataset open option in ogr_open
so GDAL can take it in account, but it isn't... so it seems there is something missing.
Ok, I think I understand better what is going on here.
For 01.shp
and 03.shp
, GDAL auto-detected the native encoding based on information in the .dbf
file, since there is no corresponding .cpg
file that explicitly states the native encoding. For 03.shp
, it determines the native encoding is ISO-8859-1
, which is not correct.
Where GDAL auto-detects the encoding, it automatically decodes the native encoding to UTF-8
and then reports to us that the data are in UTF-8
, so by default we do not do further decoding. For 01.shp
, this mechanism works correctly because the native encoding is detected as cp936
.
For 03.shp
, GDAL decodes from ISO-8859-1
to UTF-8
, which produces an incorrect intermediate in UTF-8
, which for the set of characters present, appears to allow us to then decode to cp936
without hitting an encoding error, but presumably the resulting text is totally incorrect.
For 02.shp
, GDAL is unable to auto-detect the encoding, so it returns us the text in the native encoding, which we then by default decode to ISO-8859-1
as is the standard for shapefiles when not otherwise stated, but wrong in this case. But because the original text is not decoded by GDAL before we read it, this allows us to specifically set the correct encoding when passed by the user.
There are a couple of ways to sidestep the above issues:
1) create a *.cpg
file for each shapefile that explicitly states the encoding as cp936
(the only file contents are cp936
). GDAL gives preference to .cpg
files for encoding, and then automatically decodes from the native encoding to UTF-8
that we can then return by default. In this case, do not pass in encoding="cp936"
.
2) set the SHAPE_ENCODING
config option:
from pyogrio import set_gdal_config_options
set_gdal_config_options({"SHAPE_ENCODING": "cp936"})
Note: this then applies to all read operations; I need to check for a dataset / layer read option.
We may also need to set options when encoding
is passed to pyogrio, to prevent GDAL first decoding to UTF-8
before we then try to decode using the user-specified encoding. From the above we don't appear to be doing the right thing.
@theroggy thanks for looking at this too; I'm not terribly familiar with alternative encodings.
I'm starting to wonder if we should not be trying to decode via user-passed encoding
option if GDAL reports to us that the encoding is UTF-8
, either because that is the native encoding or - in this case - it automatically converted for us.
Like you say, for reading shapefiles, we need to be opting out of GDAL's auto detection when the user passes an encoding
, so that we're always decoding from that specified encoding.
Some related bits in Fiona for further investigation: Fiona #516, Fiona #512
It looks like Fiona as removed anything that was directly setting SHAPE_ENCODING
more recently; github searches are proving unhelpful there and I'm not finding commits / issues that reference why it was removed.
It looks like we can use the open option ENCODING=""
to explicitly disable auto decoding to UTF-8
by GDAL, but the problem is that it is an open option and we don't know the driver of a data source until opening it, and other drivers do not necessarily support ENCODING
as an open option and will raise a warning. Given that there are multiple ways a shapefile could be represented, we can't do a simple check for .shp
suffix.
It looks like Fiona as removed anything that was directly setting
SHAPE_ENCODING
more recently; github searches are proving unhelpful there and I'm not finding commits / issues that reference why it was removed.It looks like we can use the open option
ENCODING=""
to explicitly disable auto decoding toUTF-8
by GDAL, but the problem is that it is an open option and we don't know the driver of a data source until opening it, and other drivers do not necessarily supportENCODING
as an open option and will raise a warning. Given that there are multiple ways a shapefile could be represented, we can't do a simple check for.shp
suffix.
Thank you so much for your help. When I noticed garbled text in the feedback, I realized that I forgot to mention that the shp
file also displayed chinese garbled text when using the sql
parameter in read_dataframe()
, regardless of the specified encoding
parameter. I believe your fix has also resolved this issue. I am currently on vacation today, but I will verify it tomorrow.
resolved by #380
Windows 10 professional 22H2 19045.4170
pyogrio == 0.7.2 GDAL == 3.8.4 fiona == 1.9.5
When using pyogrio.read_dataframe() to read a shp file, if the encoding of the dbf file is gbk, specify the parameter encoding='gbk' or encoding='cp936'. There are two exceptional situations encountered: when specifying encoding='cp936' in fiona, there are no similar issues: