geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
272 stars 22 forks source link

BUG: use_arrow=True changes boolean values when reading Geopackage #334

Closed Lotuss91 closed 8 months ago

Lotuss91 commented 9 months ago

Hi, this is my first time submitting an issue on GitHub so I apologize in advance in case I do not follow the etiquette.

When reading a geopackage with (generated using geopandas.to_file) with geopandas.read_file, the the values of all the boolean columns are changed. The weird thing not all values of the columns are changes but most of them are. Is it possible that the issue is related to the PyArraow own datatypes?

Below I provide the use case that generates the issue.

# normal behaviour
basic = gpd.read_file(input_folder / file_name)
pyogr = gpd.read_file(input_folder / file_name, engine="pyogrio")
arrow = gpd.read_file(input_folder / file_name, use_arrow=True)

# boolean values changes
pyogr_arrow = gpd.read_file(input_folder / file_name, engine="pyogrio", use_arrow=True)

If useful, I can provide a sample file.

Thank You

martinfleis commented 9 months ago

the values of all the boolean columns are changed

Can you be more specific?

If useful, I can provide a sample file.

If it is a small one, it may help, thanks!

Lotuss91 commented 9 months ago

Thank you for the quick reply. By "the boolean columns are changed" I mean that several element (more than 50% but not 100%) of the columns are flipped from False to True(or viceversa)

I sent you a sample file to you e-mail.

martinfleis commented 9 months ago

Here's a sample file attached.

basic = pyogrio.read_dataframe("sample.gpkg.zip")
w_arrow =  pyogrio.read_dataframe("sample.gpkg.zip", use_arrow=True)

this results each in different values in the boolean column foo.

>>> basic
     id    foo     geometry
0    18  False  POINT (0 0)
1    48   True  POINT (0 0)
2    49   True  POINT (0 0)
3    50   True  POINT (0 0)
4    51   True  POINT (0 0)
..  ...    ...          ...
95  349   True  POINT (0 0)
96  350   True  POINT (0 0)
97  351   True  POINT (0 0)
98  352   True  POINT (0 0)
99  393   True  POINT (0 0)

[100 rows x 3 columns]

>>> w_arrow
     id    foo     geometry
0    18   True  POINT (0 0)
1    48  False  POINT (0 0)
2    49  False  POINT (0 0)
3    50  False  POINT (0 0)
4    51  False  POINT (0 0)
..  ...    ...          ...
95  349  False  POINT (0 0)
96  350  False  POINT (0 0)
97  351  False  POINT (0 0)
98  352  False  POINT (0 0)
99  393  False  POINT (0 0)

[100 rows x 3 columns]

But the values are not just inverted.

>>> (basic.foo == w_arrow.foo).sum()
7

The issue is already present in results from read_arrow so my hunch is that it comes like this from GDAL but I'll leave this debugging to more capable maintainers.

sample.gpkg.zip

(I wanted to check if the same issue is present in sf but I would have to compile it as it comes with GDAL 3.5.3...)

Lotuss91 commented 9 months ago

Thank you for checking

But the values are not just inverted.

That is the most weird thing, at first I thought it was scrambling the values but that is not the case. Let me know if you need anything else.

theroggy commented 9 months ago

I wrote a script that reproduces the issue only using gdal... and this indeed shows the same problem.

So I opened an issue in the gdal issue tracker: https://github.com/OSGeo/gdal/issues/8998

theroggy commented 9 months ago

It has been fixed in GDAL, so when GDAL 3.8.3 is released, probably in a few months as GDAL 3.8.2 was only released recently, it should be solved...

Lotuss91 commented 9 months ago

Thank you for addressing the issue. In the meantime that 3.8.3 is released, the only solution would be to avoid using arrowand pyogrio together?

martinfleis commented 9 months ago

It is not about using pyogrio and arrow together, it is about using arrow. If you do gpd.read_file(input_folder / file_name, use_arrow=True), then the use_arrow keyword is silently ignored and the default Fiona I/O is used. You can't use arrow with geopandas in any other way than via pyogrio. So my recommendation would be to use gpd.read_file(input_folder / file_name, engine="pyogrio").

theroggy commented 8 months ago

GDAL 3.8.3 has just been released...