geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
259 stars 22 forks source link

ENH: Add test of Arrow roundtrip of boolean values, raise exception on GDAL <3.8.3 for bool values via Arrow API #335

Closed brendan-ward closed 5 months ago

brendan-ward commented 6 months ago

Test to reproduce #334 incorrect handling of boolean values using Arrow I/O (mostly to see if this also fails on GDAL >= 3.8.1).

As the root cause has been solved and released in GDAL, this PR now resolves https://github.com/geopandas/pyogrio/issues/334

theroggy commented 6 months ago

GDAL 3.8.3 fixes this and has just been released...

brendan-ward commented 5 months ago

Because misreading boolean values is a bad error, this adds an exception to read_arrow if boolean values are detected in the pyarrow table read from GDAL. I didn't add an exception to open_arrow because we don't know the contents of the stream it returns. I'm assuming that given the severity of this bug, it is better to raise an exception and instruct users to avoid the arrow API for GDAL < 3.8.3 than to do something less intrusive like raise a warning that might be ignored by the user.

I used an xfail for the failing test case so that we can detect if this fix gets backported to other GDAL versions (e.g., if there is a 3.7.4 that includes it)

theroggy commented 5 months ago

Because misreading boolean values is a bad error, this adds an exception to read_arrow if boolean values are detected in the pyarrow table read from GDAL. I didn't add an exception to open_arrow because we don't know the contents of the stream it returns. I'm assuming that given the severity of this bug, it is better to raise an exception and instruct users to avoid the arrow API for GDAL < 3.8.3 than to do something less intrusive like raise a warning that might be ignored by the user.

Good idea!

I used an xfail for the failing test case so that we can detect if this fix gets backported to other GDAL versions (e.g., if there is a 3.7.4 that includes it)

I'm not sure if the xfail is that interesting as the test testing for the exception for GDAL < 3.8 will fail anyway. It feels like it's a pity that it is not possible to specify a match= in the xfail decorator. If that would be possible I think a single test with xfail could get the following behaviour which is I think the ideal behaviour to get?

Or am I missing something?

brendan-ward commented 5 months ago

No, you're right: I didn't have my logic correct for the xfail because we also specifically check the GDAL version in the implementation and always raise an exception. I don't want to make that more granular by preemptively adding GDAL versions that don't yet exist to try and catch this, e.g., in read_arrow:

if not (gdal_version >= (3, 8, 3) or gdal_version > (3,7,3)) and any(
        [col_type == "bool" for col_type in table.schema.types]
    )

Instead, we probably just have to watch the changelog for GDAL and update the implementation if the fix gets backported (it may not; I recall that some of the other Arrow fixes were not going to be backported from the 3.8.x lineage).

Will remove the xfail cases...

jorisvandenbossche commented 5 months ago

According to the GDAL PR fixing this, this only affects reading GPKG/FlatGeoBuf files. Would we have a way to check for that? (we don't know the driver because that gets inferred while reading, except if we explicitly get that info (read_info returns it?). But maybe we could then only do that when we detect there is a boolean column)

brendan-ward commented 5 months ago

Indeed; I had assumed it applied to all the bool drivers because it was in OGRArrowArrayHelper::SetBoolOn() (even though it did specifically mention only those drivers).

It looks like boolean values roundtripped properly for SQLite / GeoJSON / GML (not able to easily test postgres) - so our exception would be raised when the functionality is just fine. Will add a fix for this.

jorisvandenbossche commented 5 months ago

Maybe it is because that utility function is only used by drivers that have a custom implementation of GetArrowStream that sidesteps the internal GDAL data model (which are only GPKG and FlatGeoBuf, AFAIK, apart from Parquet/Arrow, but those already are already in arrow format after reading, of course). Other drivers fall back to a default implementation based on the standard APIs to read data (like we also use)