Closed kylebarron closed 8 months ago
Hmm, that's a bit tricky. You have missing geometries, and those WKB values first get put in an ArrowDtype(pa.binary())
columns, and then this column is passed to shapely.from_wkb
. This converts the input data to a numpy array, and at that point, pandas will insert pd.NA
instead of None
, in case of an ArrowDtype.
To illustrate that, we can reproduce it with the following as well:
In [8]: import shapely
In [9]: wkb_arr = shapely.to_wkb(shapely.from_wkt(["POINT (1 1)", None]))
In [10]: wkb_arr
Out[10]:
array([b'\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\xf0?',
None], dtype=object)
In [12]: pa.array(wkb_arr)
Out[12]:
<pyarrow.lib.BinaryArray object at 0x7fde920c69e0>
[
0101000000000000000000F03F000000000000F03F,
null
]
In [13]: pd.Series(wkb_arr, dtype=pd.ArrowDtype(pa.binary()))
Out[13]:
0 b'\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00...
1 <NA>
dtype: binary[pyarrow]
In [14]: np.asarray(pd.Series(wkb_arr, dtype=pd.ArrowDtype(pa.binary())))
Out[14]:
array([b'\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\xf0?',
<NA>], dtype=object)
In [15]: shapely.from_wkb(_)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[15], line 1
----> 1 shapely.from_wkb(_)
File ~/miniconda3/envs/geo-dev2/lib/python3.11/site-packages/shapely/io.py:325, in from_wkb(geometry, on_invalid, **kwargs)
321 # ensure the input has object dtype, to avoid numpy inferring it as a
322 # fixed-length string dtype (which removes trailing null bytes upon access
323 # of array elements)
324 geometry = np.asarray(geometry, dtype=object)
--> 325 return lib.from_wkb(geometry, invalid_handler, **kwargs)
TypeError: Expected bytes or string, got NAType
One workaround is that we would pass the original arrow column to from_wkb
inside read_dataframe
, or explicitly converting the column to a numpy array with None as NA value:
In [17]: ser.to_numpy()
Out[17]:
array([b'\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\xf0?',
<NA>], dtype=object)
In [18]: ser.to_numpy(na_value=None)
Out[18]:
array([b'\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\xf0?',
None], dtype=object)
In [19]: shapely.from_wkb(ser.to_numpy(na_value=None))
Out[19]: array([<POINT (1 1)>, None], dtype=object)
Is there a fix we need to add to 0.7.2 before it goes out?
Opened a PR with a fix at https://github.com/geopandas/pyogrio/pull/321
I don't have the time right now to dig exactly into what's happening. If I comment out passing in
arrow_to_pandas_kwargs
, it works. But the error is in geometry handling, not attribute handling, which I wouldn't have expected.My pyogrio version is 0.7.1
Code:
Traceback: