Open vlebert opened 2 months ago
Hi! Thanks for opening this issue! Unfortunately duckdb does not support encodings other than utf-8, and even though st_read uses GDAL under the hood, i think the issue is that we dont bundle the (optional) iconv library that gives GDAL the capability to re-encode text.
We are trying to reduce the amount of depencies in the spatial extension, so it is unlikely this use case will ever be supported.
It's bad news as there are many shapefiles with exotic encoding in the nature :)
Could we at least have garbage text fields instead of a fatal error
In many case the characters with accent could be located in columns not even used in the dataflow
Currently
In other tool, encoding is often an issue but not causing critical error
So spatial has its own experimental shape file reader, st_readshp
where you should be able to pass an extra encoding := 'blob'
optional argument which will read any string fields as DuckDB BLOB
's which you can then decode()
into VARCHAR
if they are valid utf8.
Using the experimental st_readshp
(without extra argument), I could load the dataset. All text fields are loaded as blob
i think the issue is that we dont bundle the (optional) iconv library that gives GDAL the capability to re-encode text.
https://github.com/OSGeo/gdal/pull/10799 should improve that situation
What happens?
When trying to import a shapefile encoded with CP1252 I have the following error
InvalidInputException: Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update
I tried various options in the st_read but no success
A current workaround is to convert first shapefile to geoparquet with ogr2ogr and then import the geoparquet
To Reproduce
Note : the shapefile does have a
.cpg
file providing the encodingEven forcing encoding do fail:
However,
SELECT * FROM ST_READ('source/t_adresse.shp')
does not gives error (in python)Current workaround :
OS:
MacOS
DuckDB Version:
1.1
DuckDB Client:
Python
Hardware:
No response
Full Name:
Valérian LEBERT
Affiliation:
Digi-Studio
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
No - I cannot easily share my data sets due to their large size
Did you include all code required to reproduce the issue?
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?