TLDR: calling combine_chunks, when an attribute filter, bbox or mask is supplied, might be worth doing unconditionally.
Environment details (conda managed, linux x64):
python 3.10.13
pyogrio 0.7.2
pyarrow, libarrow* 14.0.1
gdal 3.8.0
libgdal-arrow-parquet 3.8.0
geopandas 0.14.1 (for the mask)
fsspec, s3fs 2023.10.0
Reproduction:
import pyogrio
import pyogrio.raw
import geopandas as gpd
# NB: independent of fs, but somewhat dependent on data
target_path = "s3://overturemaps-us-west-2/release/2023-11-14-alpha.0/part-00000-6cb89013-4ec2-4b94-8e4b-8e27c7d30865.c000.zstd.parquet"
countries_gdf = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
au_mask = countries_gdf[countries_gdf.name == 'Australia'].geometry.iloc[0]
meta, table = pyogrio.raw.read_arrow(target_path, mask=au_mask)
# Attempt to convert to pandas, with or without self-destruct
table.to_pandas() # ⚡ throws obscure/unpredictable column conversion errors
Supplying the max_features flag fixes this because combine_chunks is called when that flag's present (plugging in a very large max_features is enough for a temporary workaround).
It seems to be restricted to geoparquet files involving Map<String, String> columns - running validate(full=True) without first calling combine_chunks, on just the map columns (sourceTags in this case) gets you this:
In chunk 0: Invalid: List child array invalid: Invalid: Struct child array #0 invalid: Invalid: Offset invariant failure: non-monotonic offset at slot 3505: 23393 < 24692
I suspect the post-filter arrow stream construction is slightly bugged in gdal, given a number of 'null count greater than array length' style validation errors also crop up (though these are much less severe, since to_pandas corrects them).
TLDR: calling combine_chunks, when an attribute filter, bbox or mask is supplied, might be worth doing unconditionally.
Environment details (conda managed, linux x64):
Reproduction:
Supplying the max_features flag fixes this because combine_chunks is called when that flag's present (plugging in a very large max_features is enough for a temporary workaround).
It seems to be restricted to geoparquet files involving Map<String, String> columns - running
validate(full=True)
without first calling combine_chunks, on just the map columns (sourceTags in this case) gets you this:I suspect the post-filter arrow stream construction is slightly bugged in gdal, given a number of 'null count greater than array length' style validation errors also crop up (though these are much less severe, since
to_pandas
corrects them).