Closed jorisvandenbossche closed 9 months ago
I suppose it would be useful to document this?
Do you think we should also add a CI runner that has this environment variable set? It could be allowed to fail in the short term, while we're still finding places where
use_arrow
produces different behavior or is not yet supported for all options...
I quickly tried that locally, and this turned up https://github.com/OSGeo/gdal/issues/8509. When skipping test_write_empty_dataframe
for FlatGeobuf, I get the following failures at the moment:
FAILED pyogrio/tests/test_arrow.py::test_enable_with_environment_variable - AssertionError: assert 'list_int64' not in Index(['int64', 'list_int64', 'geometry'], dtype='object')
FAILED pyogrio/tests/test_geopandas_io.py::test_read_dataframe_vsi - pyarrow.lib.ArrowException: Unknown error: Wrapping C�te d'Ivoire failed
FAILED pyogrio/tests/test_geopandas_io.py::test_read_force_2d - ValueError: forcing 2D is not supported for Arrow
FAILED pyogrio/tests/test_geopandas_io.py::test_read_where_invalid[.gpkg] - OSError
FAILED pyogrio/tests/test_geopandas_io.py::test_read_fids[.fgb] - ValueError: reading by FID is not supported for Arrow
FAILED pyogrio/tests/test_geopandas_io.py::test_read_fids[.geojson] - ValueError: reading by FID is not supported for Arrow
FAILED pyogrio/tests/test_geopandas_io.py::test_read_fids[.geojsonl] - ValueError: reading by FID is not supported for Arrow
FAILED pyogrio/tests/test_geopandas_io.py::test_read_fids[.gpkg] - ValueError: reading by FID is not supported for Arrow
FAILED pyogrio/tests/test_geopandas_io.py::test_read_fids[.shp] - ValueError: reading by FID is not supported for Arrow
FAILED pyogrio/tests/test_geopandas_io.py::test_read_fids_force_2d - ValueError: reading by FID is not supported for Arrow
FAILED pyogrio/tests/test_geopandas_io.py::test_read_multisurface - shapely.errors.GEOSException: ParseException: Unknown WKB type 12
FAILED pyogrio/tests/test_geopandas_io.py::test_write_nullable_dtypes - AssertionError: Attributes of GeoDataFrame.iloc[:, 3] (column name="col4") are different
FAILED pyogrio/tests/test_path.py::test_vsi_handling_read_dataframe - pyarrow.lib.ArrowException: Unknown error: Wrapping C�te d'Ivoire failed
FAILED pyogrio/tests/test_path.py::test_zip_path_dataframe - pyarrow.lib.ArrowException: Unknown error: Wrapping C�te d'Ivoire failed
So most of them are about forcing 2D or FID not yet being supported (there are PRs for that), three errors are about a conversion error for "C�te d'Ivoire". That's because there is invalid UTF8 in that column, and the conversion from arrow to pandas fails because of that reason:
In [29]: meta, table = pyogrio.raw.read_arrow("naturalearth_lowres.shp.zip")
In [30]: arr = table["name"].chunk(0)
In [31]: arr
Out[31]:
<pyarrow.lib.StringArray object at 0x7f11a2f9e0e0>
[
"Fiji",
"Tanzania",
"W. Sahara",
"Canada",
"United States of America",
"Kazakhstan",
"Uzbekistan",
"Papua New Guinea",
"Indonesia",
"Argentina",
...
"Somaliland",
"Uganda",
"Rwanda",
"Bosnia and Herz.",
"Macedonia",
"Serbia",
"Montenegro",
"Kosovo",
"Trinidad and Tobago",
"S. Sudan"
]
In [32]: arr.validate(full=True)
...
ArrowInvalid: Invalid UTF8 sequence at string index 60
In [33]: arr.to_pandas()
...
ArrowException: Unknown error: Wrapping C�te d'Ivoire failed
This does not happen when reading the non-zipped shapefile that is included in our fixture data (we zip this on the fly in conftest.py for testing). So that would require a bit of investigation whether that's an issue on our side with how we compress it, or on the gdal side with reading from a compressed shapefile (although given that when not using arrow, the read data looks fine (I see "Côte d'Ivoire" in the output) that might indicate an issue on the GDAL side).
It might be something with the python ZipFile implementation. If I manually compress the shapefiles (using Ubuntu's "Files" -> right click -> Compress, or with the zip
cli), the compressed file reads fine with use_arrow=True
While with compressing with Python, it fails:
from pathlib import Path
from zipfile import ZipFile, ZIP_DEFLATED
naturalearth_lowres = Path("repos/pyogrio/pyogrio/tests/fixtures/naturalearth_lowres/naturalearth_lowres.shp")
path = f"{naturalearth_lowres.name}.zip"
with ZipFile(path, mode="w") as out:
for ext in ["dbf", "prj", "shp", "shx"]:
filename = f"{naturalearth_lowres.stem}.{ext}"
out.write(naturalearth_lowres.parent / filename, filename)
In [63]: pyogrio.read_dataframe('naturalearth_lowres.shp.zip', use_arrow=False)
Out[63]:
pop_est continent name iso_a3 gdp_md_est geometry
0 920938 Oceania Fiji FJI 8374.0 MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1 53950935 Africa Tanzania TZA 150600.0 POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
.. ... ... ... ... ... ...
175 1218208 North America Trinidad and Tobago TTO 43570.0 POLYGON ((-61.68000 10.76000, -61.10500 10.890...
176 13026129 Africa S. Sudan SSD 20880.0 POLYGON ((30.83385 3.50917, 29.95350 4.17370, ...
[177 rows x 6 columns]
In [64]: pyogrio.read_dataframe('naturalearth_lowres.shp.zip', use_arrow=True)
...
ArrowException: Unknown error: Wrapping C�te d'Ivoire failed
Closes https://github.com/geopandas/pyogrio/issues/296