geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
260 stars 22 forks source link

ENH: Support pandas nullable dtypes such as boolean and string #219

Closed jtmiclat closed 1 year ago

jtmiclat commented 1 year ago

Hi, I was getting an error when using the write_dataframe() when using geodataframes containing booleans from google-cloud-bigquery

from google.cloud import bigquery
import pyogrio
client = bigquery.Client()

gdf = client.query("""SELECT ST_GEOGPOINT(1, 1) as geometry, True as bool_field """).to_geodataframe()
pyogrio.write_dataframe(gdf, "test2.json", driver="GeoJSON" )

# usr/local/lib/python3.8/dist-packages/pyogrio/_io.pyx in pyogrio._io.infer_field_types()
# 
# NotImplementedError: field type is not supported boolean (field index: 0)

After some digging I figured out that the returned dtype for bool_field was a pandas.BooleanDtype/boolean instead of bool and was able to replicated it without bigquery. It seems to be able to work fine with fiona but breaks with pyogrio

import geopandas as gpd
from shapely import Point
from pandas import BooleanDtype
import pyogrio

gdf = gpd.GeoDataFrame([{'geometry': Point(1,1), "bool_field":True}])
gdf.dtypes

# geometry      geometry
# bool_field        bool
# dtype: object

gdf2 = gdf.astype({'bool_field': BooleanDtype()})
gdf2.dtypes

# geometry      geometry
# bool_field     boolean
# dtype: object

# Works with bool dtype 
pyogrio.write_dataframe(gdf, "1.json", driver="GeoJSON" )

# Works with boolean dtype 
gdf2.to_file("test.json", driver="GeoJSON")

# This throws the same error
pyogrio.write_dataframe(gdf2, "test2.json", driver="GeoJSON" )

# usr/local/lib/python3.8/dist-packages/pyogrio/_io.pyx in pyogrio._io.infer_field_types()
# 
# NotImplementedError: field type is not supported boolean (field index: 0)

My hunch is to add boolean to https://github.com/geopandas/pyogrio/blob/75e8f13940fea6e30554115760275c7da978058c/pyogrio/_io.pyx#L60-L84 Thanks for the wonderful work!

jorisvandenbossche commented 1 year ago

@jtmiclat Thanks for the report! In general, we don't yet support the pandas nullable dtypes such as boolean.

As long as there are no missing values, adding boolean to the DTYPE_OGR_FIELD_TYPES mapping might be sufficient, but for missing values we will certainly need to add support for recognizing pd.NA as missing value. It might also be more efficient to add a support for having field data as both values + mask array.

jtmiclat commented 1 year ago

@jorisvandenbossche I did some initial testing and adding boolean to DTYPE_OGR_FIELD_TYPES does address my issue but fails when there is a pd.NA in the column. The error message isn't super clear for the user

>   OGR_F_SetFieldInteger(ogr_feature, field_idx, field_value)
E   TypeError: an integer is required

pyogrio/_io.pyx:1631: TypeError

I think it is best to wait for support for recognizing pd.NA. Thanks!

m-richards commented 1 year ago

I imagine that supporting writing dataframes with dtype="string" falls into a similar category? - as that is also nullable I've been introducing pyogrio to some colleagues who are super impressed at the speed difference compared to fiona for reading large networks, and we came across the this behaviour difference with fiona.

Oreilles commented 1 year ago

Seems like dtypes string and analogous (string[python], string[pyarrow] as well as category don't work out of the box, and need to be casted to object.

Maybe we should change the title of this issue to indicate that it is a broader issue, or open another one ?

Some documentation in that regard would be welcome too.

jtmiclat commented 1 year ago

@Oreilles renamed the issue to an ENH request to support nullable fields. I think category and other custom dtype support would be a separate issue!