geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
257 stars 21 forks source link

BUG: reading with arrow returns empty GeoDataFrame when column referenced in where parameter is not included in results #388

Closed brendan-ward closed 2 months ago

brendan-ward commented 2 months ago

Observed with GDAL 3.8.3 on MacOS

from pyogrio import read_dataframe

filename = "pyogrio/tests/fixtures/naturalearth_lowres/naturalearth_lowres.shp"
df = read_dataframe(
    filename, where=""" "iso_a3" = 'CAN' """, use_arrow=True, columns=[]
)

yields

Empty GeoDataFrame
Columns: [geometry]
Index: []

when it should have one record.

Unclear if this is an error on our side our in GDAL.

brendan-ward commented 2 months ago

Reported to GDAL #9655

brendan-ward commented 2 months ago

Per further tests in GDAL #9655, the GDAL Python bindings are not giving the same results when not using the Arrow API as we are getting here. Those return 0 features when not using Arrow API, same as using the Arrow API.

In contrast here:

df = read_dataframe(
    filename, where=""" "iso_a3" = 'CAN' """,columns=["name"]
)

returns

     name                                           geometry
0  Canada  MULTIPOLYGON (((-122.84000 49.00000, -122.9742...

This suggests a possible error on our end, though I'm not yet sure how we'd get into a state where GDAL expects no features and yet we return some.


Per GDAL #9664, we should update our docs to indicate that it is not recommended to use where against columns not present in columns if both are provided.

brendan-ward commented 2 months ago

Found our bug: we were setting the set of ignored fields after narrowing the list of fields down to those in columns, which meant that ignored fields were never set and we didn't pass those to GDAL.

Fix forthcoming...