geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
259 stars 22 forks source link

Bug: read_arrow with attribute or spatial filtering produces unusable Table (sometimes) #326

Closed H-Plus-Time closed 6 months ago

H-Plus-Time commented 7 months ago

TLDR: calling combine_chunks, when an attribute filter, bbox or mask is supplied, might be worth doing unconditionally.

Environment details (conda managed, linux x64):

Reproduction:

import pyogrio
import pyogrio.raw
import geopandas as gpd

# NB: independent of fs, but somewhat dependent on data
target_path = "s3://overturemaps-us-west-2/release/2023-11-14-alpha.0/part-00000-6cb89013-4ec2-4b94-8e4b-8e27c7d30865.c000.zstd.parquet"

countries_gdf = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
au_mask = countries_gdf[countries_gdf.name == 'Australia'].geometry.iloc[0]

meta, table = pyogrio.raw.read_arrow(target_path, mask=au_mask)
# Attempt to convert to pandas, with or without self-destruct
table.to_pandas() # ⚡ throws obscure/unpredictable column conversion errors

Supplying the max_features flag fixes this because combine_chunks is called when that flag's present (plugging in a very large max_features is enough for a temporary workaround).

It seems to be restricted to geoparquet files involving Map<String, String> columns - running validate(full=True) without first calling combine_chunks, on just the map columns (sourceTags in this case) gets you this:

In chunk 0: Invalid: List child array invalid: Invalid: Struct child array #0 invalid: Invalid: Offset invariant failure: non-monotonic offset at slot 3505: 23393 < 24692

I suspect the post-filter arrow stream construction is slightly bugged in gdal, given a number of 'null count greater than array length' style validation errors also crop up (though these are much less severe, since to_pandas corrects them).

brendan-ward commented 7 months ago

It looks like this is solved in GDAL #8768 in the forthcoming 3.8.1 release.

rouault commented 7 months ago

@H-Plus-Time GDAL 3.8.1 has been released and is available in Conda-Forge.