geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
259 stars 22 forks source link

TypeError: Expected bytes or string, got NAType when using Arrow-pandas `types_mapper` #319

Closed kylebarron closed 8 months ago

kylebarron commented 8 months ago

I don't have the time right now to dig exactly into what's happening. If I comment out passing in arrow_to_pandas_kwargs, it works. But the error is in geometry handling, not attribute handling, which I wouldn't have expected.

My pyogrio version is 0.7.1

Code:

import geodatasets
import geopandas as gpd
import pandas as pd

arrow_to_pandas_kwargs = {
    'types_mapper': lambda pa_dtype: pd.ArrowDtype(pa_dtype)
}
gdf = gpd.read_file(
    geodatasets.get_path("geoda.cars"),
    engine="pyogrio",
    use_arrow=True,
    arrow_to_pandas_kwargs=arrow_to_pandas_kwargs,
    X_POSSIBLE_NAMES="Longitude",
    Y_POSSIBLE_NAMES="Latitude",
    KEEP_GEOM_COLUMNS="NO",
)

Traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[21], line 4
      1 arrow_to_pandas_kwargs = {
      2     'types_mapper': lambda pa_dtype: pd.ArrowDtype(pa_dtype)
      3 }
----> 4 gdf = gpd.read_file(
      5     geodatasets.get_path("geoda.cars"),
      6     engine="pyogrio",
      7     use_arrow=True,
      8     arrow_to_pandas_kwargs=arrow_to_pandas_kwargs,
      9     X_POSSIBLE_NAMES="Longitude",
     10     Y_POSSIBLE_NAMES="Latitude",
     11     KEEP_GEOM_COLUMNS="NO",
     12 )

File ~/github/developmentseed/lonboard/.venv/lib/python3.11/site-packages/geopandas/io/file.py:271, in _read_file(filename, bbox, mask, rows, engine, **kwargs)
    268             from_bytes = True
    270 if engine == "pyogrio":
--> 271     return _read_file_pyogrio(filename, bbox=bbox, mask=mask, rows=rows, **kwargs)
    273 elif engine == "fiona":
    274     if pd.api.types.is_file_like(filename):

File ~/github/developmentseed/lonboard/.venv/lib/python3.11/site-packages/geopandas/io/file.py:427, in _read_file_pyogrio(path_or_bytes, bbox, mask, rows, **kwargs)
    424     kwargs["read_geometry"] = False
    426 # TODO: if bbox is not None, check its CRS vs the CRS of the file
--> 427 return pyogrio.read_dataframe(path_or_bytes, bbox=bbox, **kwargs)

File ~/github/developmentseed/lonboard/.venv/lib/python3.11/site-packages/pyogrio/geopandas.py:278, in read_dataframe(path_or_buffer, layer, encoding, columns, read_geometry, force_2d, skip_features, max_features, where, bbox, mask, fids, sql, sql_dialect, fid_as_index, use_arrow, arrow_to_pandas_kwargs, **kwargs)
    276     return pd.DataFrame()
    277 elif geometry_name in df.columns:
--> 278     df["geometry"] = from_wkb(df.pop(geometry_name), crs=meta["crs"])
    279     if force_2d:
    280         df["geometry"] = shapely.force_2d(df["geometry"])

File ~/github/developmentseed/lonboard/.venv/lib/python3.11/site-packages/geopandas/array.py:184, in from_wkb(data, crs)
    170 def from_wkb(data, crs=None):
    171     """
    172     Convert a list or array of WKB objects to a GeometryArray.
    173 
   (...)
    182 
    183     """
--> 184     return GeometryArray(vectorized.from_wkb(data), crs=crs)

File ~/github/developmentseed/lonboard/.venv/lib/python3.11/site-packages/geopandas/_vectorized.py:176, in from_wkb(data)
    172 """
    173 Convert a list or array of WKB objects to a np.ndarray[geoms].
    174 """
    175 if compat.USE_SHAPELY_20:
--> 176     return shapely.from_wkb(data)
    177 if compat.USE_PYGEOS:
    178     return pygeos.from_wkb(data)

File ~/github/developmentseed/lonboard/.venv/lib/python3.11/site-packages/shapely/io.py:320, in from_wkb(geometry, on_invalid, **kwargs)
    316 # ensure the input has object dtype, to avoid numpy inferring it as a
    317 # fixed-length string dtype (which removes trailing null bytes upon access
    318 # of array elements)
    319 geometry = np.asarray(geometry, dtype=object)
--> 320 return lib.from_wkb(geometry, invalid_handler, **kwargs)

TypeError: Expected bytes or string, got NAType
jorisvandenbossche commented 8 months ago

Hmm, that's a bit tricky. You have missing geometries, and those WKB values first get put in an ArrowDtype(pa.binary()) columns, and then this column is passed to shapely.from_wkb. This converts the input data to a numpy array, and at that point, pandas will insert pd.NA instead of None, in case of an ArrowDtype.

To illustrate that, we can reproduce it with the following as well:

In [8]: import shapely

In [9]: wkb_arr = shapely.to_wkb(shapely.from_wkt(["POINT (1 1)", None]))

In [10]: wkb_arr
Out[10]: 
array([b'\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\xf0?',
       None], dtype=object)

In [12]: pa.array(wkb_arr)
Out[12]: 
<pyarrow.lib.BinaryArray object at 0x7fde920c69e0>
[
  0101000000000000000000F03F000000000000F03F,
  null
]

In [13]: pd.Series(wkb_arr, dtype=pd.ArrowDtype(pa.binary()))
Out[13]: 
0    b'\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00...
1                                                 <NA>
dtype: binary[pyarrow]

In [14]: np.asarray(pd.Series(wkb_arr, dtype=pd.ArrowDtype(pa.binary())))
Out[14]: 
array([b'\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\xf0?',
       <NA>], dtype=object)

In [15]: shapely.from_wkb(_)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[15], line 1
----> 1 shapely.from_wkb(_)

File ~/miniconda3/envs/geo-dev2/lib/python3.11/site-packages/shapely/io.py:325, in from_wkb(geometry, on_invalid, **kwargs)
    321 # ensure the input has object dtype, to avoid numpy inferring it as a
    322 # fixed-length string dtype (which removes trailing null bytes upon access
    323 # of array elements)
    324 geometry = np.asarray(geometry, dtype=object)
--> 325 return lib.from_wkb(geometry, invalid_handler, **kwargs)

TypeError: Expected bytes or string, got NAType

One workaround is that we would pass the original arrow column to from_wkb inside read_dataframe, or explicitly converting the column to a numpy array with None as NA value:

In [17]: ser.to_numpy()
Out[17]: 
array([b'\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\xf0?',
       <NA>], dtype=object)

In [18]: ser.to_numpy(na_value=None)
Out[18]: 
array([b'\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\xf0?',
       None], dtype=object)

In [19]: shapely.from_wkb(ser.to_numpy(na_value=None))
Out[19]: array([<POINT (1 1)>, None], dtype=object)
brendan-ward commented 8 months ago

Is there a fix we need to add to 0.7.2 before it goes out?

jorisvandenbossche commented 8 months ago

Opened a PR with a fix at https://github.com/geopandas/pyogrio/pull/321