Closed Lotuss91 closed 8 months ago
the values of all the boolean columns are changed
Can you be more specific?
If useful, I can provide a sample file.
If it is a small one, it may help, thanks!
Thank you for the quick reply.
By "the boolean columns are changed" I mean that several element (more than 50% but not 100%) of the columns are flipped from False
to True
(or viceversa)
I sent you a sample file to you e-mail.
Here's a sample file attached.
basic = pyogrio.read_dataframe("sample.gpkg.zip")
w_arrow = pyogrio.read_dataframe("sample.gpkg.zip", use_arrow=True)
this results each in different values in the boolean column foo
.
>>> basic
id foo geometry
0 18 False POINT (0 0)
1 48 True POINT (0 0)
2 49 True POINT (0 0)
3 50 True POINT (0 0)
4 51 True POINT (0 0)
.. ... ... ...
95 349 True POINT (0 0)
96 350 True POINT (0 0)
97 351 True POINT (0 0)
98 352 True POINT (0 0)
99 393 True POINT (0 0)
[100 rows x 3 columns]
>>> w_arrow
id foo geometry
0 18 True POINT (0 0)
1 48 False POINT (0 0)
2 49 False POINT (0 0)
3 50 False POINT (0 0)
4 51 False POINT (0 0)
.. ... ... ...
95 349 False POINT (0 0)
96 350 False POINT (0 0)
97 351 False POINT (0 0)
98 352 False POINT (0 0)
99 393 False POINT (0 0)
[100 rows x 3 columns]
But the values are not just inverted.
>>> (basic.foo == w_arrow.foo).sum()
7
The issue is already present in results from read_arrow
so my hunch is that it comes like this from GDAL but I'll leave this debugging to more capable maintainers.
(I wanted to check if the same issue is present in sf but I would have to compile it as it comes with GDAL 3.5.3...)
Thank you for checking
But the values are not just inverted.
That is the most weird thing, at first I thought it was scrambling the values but that is not the case. Let me know if you need anything else.
I wrote a script that reproduces the issue only using gdal... and this indeed shows the same problem.
So I opened an issue in the gdal issue tracker: https://github.com/OSGeo/gdal/issues/8998
It has been fixed in GDAL, so when GDAL 3.8.3 is released, probably in a few months as GDAL 3.8.2 was only released recently, it should be solved...
Thank you for addressing the issue.
In the meantime that 3.8.3 is released, the only solution would be to avoid using arrow
and pyogrio
together?
It is not about using pyogrio and arrow together, it is about using arrow. If you do gpd.read_file(input_folder / file_name, use_arrow=True)
, then the use_arrow
keyword is silently ignored and the default Fiona I/O is used. You can't use arrow with geopandas in any other way than via pyogrio. So my recommendation would be to use gpd.read_file(input_folder / file_name, engine="pyogrio")
.
GDAL 3.8.3 has just been released...
Hi, this is my first time submitting an issue on GitHub so I apologize in advance in case I do not follow the etiquette.
When reading a geopackage with (generated using
geopandas.to_file
) withgeopandas.read_file
, the the values of all the boolean columns are changed. The weird thing not all values of the columns are changes but most of them are. Is it possible that the issue is related to the PyArraow own datatypes?Below I provide the use case that generates the issue.
If useful, I can provide a sample file.
Thank You