geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
259 stars 22 forks source link

Add PYOGRIO_USE_ARROW environment variable to enable Arrow usage globally #302

Closed jorisvandenbossche closed 9 months ago

jorisvandenbossche commented 9 months ago

Closes https://github.com/geopandas/pyogrio/issues/296

theroggy commented 9 months ago

I suppose it would be useful to document this?

jorisvandenbossche commented 9 months ago

Do you think we should also add a CI runner that has this environment variable set? It could be allowed to fail in the short term, while we're still finding places where use_arrow produces different behavior or is not yet supported for all options...

I quickly tried that locally, and this turned up https://github.com/OSGeo/gdal/issues/8509. When skipping test_write_empty_dataframe for FlatGeobuf, I get the following failures at the moment:

FAILED pyogrio/tests/test_arrow.py::test_enable_with_environment_variable - AssertionError: assert 'list_int64' not in Index(['int64', 'list_int64', 'geometry'], dtype='object')
FAILED pyogrio/tests/test_geopandas_io.py::test_read_dataframe_vsi - pyarrow.lib.ArrowException: Unknown error: Wrapping C�te d'Ivoire failed
FAILED pyogrio/tests/test_geopandas_io.py::test_read_force_2d - ValueError: forcing 2D is not supported for Arrow
FAILED pyogrio/tests/test_geopandas_io.py::test_read_where_invalid[.gpkg] - OSError
FAILED pyogrio/tests/test_geopandas_io.py::test_read_fids[.fgb] - ValueError: reading by FID is not supported for Arrow
FAILED pyogrio/tests/test_geopandas_io.py::test_read_fids[.geojson] - ValueError: reading by FID is not supported for Arrow
FAILED pyogrio/tests/test_geopandas_io.py::test_read_fids[.geojsonl] - ValueError: reading by FID is not supported for Arrow
FAILED pyogrio/tests/test_geopandas_io.py::test_read_fids[.gpkg] - ValueError: reading by FID is not supported for Arrow
FAILED pyogrio/tests/test_geopandas_io.py::test_read_fids[.shp] - ValueError: reading by FID is not supported for Arrow
FAILED pyogrio/tests/test_geopandas_io.py::test_read_fids_force_2d - ValueError: reading by FID is not supported for Arrow
FAILED pyogrio/tests/test_geopandas_io.py::test_read_multisurface - shapely.errors.GEOSException: ParseException: Unknown WKB type 12
FAILED pyogrio/tests/test_geopandas_io.py::test_write_nullable_dtypes - AssertionError: Attributes of GeoDataFrame.iloc[:, 3] (column name="col4") are different
FAILED pyogrio/tests/test_path.py::test_vsi_handling_read_dataframe - pyarrow.lib.ArrowException: Unknown error: Wrapping C�te d'Ivoire failed
FAILED pyogrio/tests/test_path.py::test_zip_path_dataframe - pyarrow.lib.ArrowException: Unknown error: Wrapping C�te d'Ivoire failed

So most of them are about forcing 2D or FID not yet being supported (there are PRs for that), three errors are about a conversion error for "C�te d'Ivoire". That's because there is invalid UTF8 in that column, and the conversion from arrow to pandas fails because of that reason:

In [29]: meta, table = pyogrio.raw.read_arrow("naturalearth_lowres.shp.zip")

In [30]: arr = table["name"].chunk(0)

In [31]: arr
Out[31]: 
<pyarrow.lib.StringArray object at 0x7f11a2f9e0e0>
[
  "Fiji",
  "Tanzania",
  "W. Sahara",
  "Canada",
  "United States of America",
  "Kazakhstan",
  "Uzbekistan",
  "Papua New Guinea",
  "Indonesia",
  "Argentina",
  ...
  "Somaliland",
  "Uganda",
  "Rwanda",
  "Bosnia and Herz.",
  "Macedonia",
  "Serbia",
  "Montenegro",
  "Kosovo",
  "Trinidad and Tobago",
  "S. Sudan"
]

In [32]: arr.validate(full=True)
...
ArrowInvalid: Invalid UTF8 sequence at string index 60

In [33]: arr.to_pandas()
...
ArrowException: Unknown error: Wrapping C�te d'Ivoire failed

This does not happen when reading the non-zipped shapefile that is included in our fixture data (we zip this on the fly in conftest.py for testing). So that would require a bit of investigation whether that's an issue on our side with how we compress it, or on the gdal side with reading from a compressed shapefile (although given that when not using arrow, the read data looks fine (I see "Côte d'Ivoire" in the output) that might indicate an issue on the GDAL side).

jorisvandenbossche commented 9 months ago

It might be something with the python ZipFile implementation. If I manually compress the shapefiles (using Ubuntu's "Files" -> right click -> Compress, or with the zip cli), the compressed file reads fine with use_arrow=True

While with compressing with Python, it fails:

from pathlib import Path
from zipfile import ZipFile, ZIP_DEFLATED

naturalearth_lowres = Path("repos/pyogrio/pyogrio/tests/fixtures/naturalearth_lowres/naturalearth_lowres.shp")

path = f"{naturalearth_lowres.name}.zip"
with ZipFile(path, mode="w") as out:
    for ext in ["dbf", "prj", "shp", "shx"]:
        filename = f"{naturalearth_lowres.stem}.{ext}"
        out.write(naturalearth_lowres.parent / filename, filename)

In [63]: pyogrio.read_dataframe('naturalearth_lowres.shp.zip', use_arrow=False)
Out[63]: 
       pop_est      continent                      name iso_a3  gdp_md_est                                           geometry
0       920938        Oceania                      Fiji    FJI      8374.0  MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1     53950935         Africa                  Tanzania    TZA    150600.0  POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
..         ...            ...                       ...    ...         ...                                                ...
175    1218208  North America       Trinidad and Tobago    TTO     43570.0  POLYGON ((-61.68000 10.76000, -61.10500 10.890...
176   13026129         Africa                  S. Sudan    SSD     20880.0  POLYGON ((30.83385 3.50917, 29.95350 4.17370, ...

[177 rows x 6 columns]

In [64]: pyogrio.read_dataframe('naturalearth_lowres.shp.zip', use_arrow=True)
...
ArrowException: Unknown error: Wrapping C�te d'Ivoire failed