Encoding issue with geospatial data (shapefile)

vlebert commented 2 months ago

What happens?

When trying to import a shapefile encoded with CP1252 I have the following error

InvalidInputException: Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update

I tried various options in the st_read but no success

A current workaround is to convert first shapefile to geoparquet with ogr2ogr and then import the geoparquet

To Reproduce

CREATE TABLE t_adresse AS SELECT * FROM st_read('source/t_adresse.shp');

Note : the shapefile does have a .cpg file providing the encoding

Even forcing encoding do fail:

CREATE TABLE t_adresse AS SELECT * FROM st_read('source/t_adresse.shp', open_options=['ENCODING=WINDOWS-1252']);
CREATE TABLE t_adresse AS SELECT * FROM st_read('source/t_adresse.shp', open_options=['ENCODING=CP-1252']);
CREATE TABLE t_adresse AS SELECT * FROM st_read('source/t_adresse.shp', open_options=['ENCODING=CP1252']);

However, SELECT * FROM ST_READ('source/t_adresse.shp') does not gives error (in python)

Current workaround :

ogr2ogr -f parquet t_adresse.parquet source/t_adresse.shp

CREATE TABLE t_adresse AS SELECT * from 't_adresse.parquet'

OS:

MacOS

DuckDB Version:

1.1

DuckDB Client:

Python

Hardware:

No response

Full Name:

Valérian LEBERT

Affiliation:

Digi-Studio

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

No - I cannot easily share my data sets due to their large size

Did you include all code required to reproduce the issue?

[X] Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

[X] Yes, I have

Maxxen commented 2 months ago

Hi! Thanks for opening this issue! Unfortunately duckdb does not support encodings other than utf-8, and even though st_read uses GDAL under the hood, i think the issue is that we dont bundle the (optional) iconv library that gives GDAL the capability to re-encode text.

We are trying to reduce the amount of depencies in the spatial extension, so it is unlikely this use case will ever be supported.

vlebert commented 2 months ago

It's bad news as there are many shapefiles with exotic encoding in the nature :)

Could we at least have garbage text fields instead of a fatal error

In many case the characters with accent could be located in columns not even used in the dataflow

Currently

we don't know which column causes the error
we can't import the tables

In other tool, encoding is often an issue but not causing critical error

Maxxen commented 2 months ago

So spatial has its own experimental shape file reader, st_readshp where you should be able to pass an extra encoding := 'blob' optional argument which will read any string fields as DuckDB BLOB's which you can then decode() into VARCHAR if they are valid utf8.

vlebert commented 2 months ago

Using the experimental st_readshp (without extra argument), I could load the dataset. All text fields are loaded as blob

rouault commented 2 months ago

i think the issue is that we dont bundle the (optional) iconv library that gives GDAL the capability to re-encode text.

https://github.com/OSGeo/gdal/pull/10799 should improve that situation

duckdb / duckdb_spatial