geopandas / geopandas

Python tools for geographic data
http://geopandas.org/
BSD 3-Clause "New" or "Revised" License
4.39k stars 913 forks source link

Parquet writing does not support MultiIndex #3370

Open martinfleis opened 1 month ago

martinfleis commented 1 month ago

While pandas has no issue with MultiIndex being used for columns when saving to Parquet, GeoPandas complains.

In [1]: import geopandas as gpd
   ...: import shapely

In [2]: gdf = gpd.GeoDataFrame(
   ...:     {("foo", "bar"): [0, 1], ("foo", "baz"): [0, 1], ("dog", "cat"): [0, 1]},
   ...:     geometry=[shapely.Point(0, 0), shapely.Point(1, 2)],
   ...: )

In [3]: gdf
Out[3]: 
  foo     dog     geometry
  bar baz cat             
0   0   0   0  POINT (0 0)
1   1   1   1  POINT (1 2)

In [4]: gdf.to_parquet("temp.pq")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[4], line 1
----> 1 gdf.to_parquet("temp.pq")

File ~/Git/geopandas/geopandas/geodataframe.py:1377, in GeoDataFrame.to_parquet(self, path, index, compression, geometry_encoding, write_covering_bbox, schema_version, **kwargs)
   1370     raise ValueError(
   1371         "GeoPandas only supports using pyarrow as the engine for "
   1372         f"to_parquet: {engine!r} passed instead."
   1373     )
   1375 from geopandas.io.arrow import _to_parquet
-> 1377 _to_parquet(
   1378     self,
   1379     path,
   1380     compression=compression,
   1381     geometry_encoding=geometry_encoding,
   1382     index=index,
   1383     schema_version=schema_version,
   1384     write_covering_bbox=write_covering_bbox,
   1385     **kwargs,
   1386 )

File ~/Git/geopandas/geopandas/io/arrow.py:432, in _to_parquet(df, path, index, compression, geometry_encoding, schema_version, write_covering_bbox, **kwargs)
    427 parquet = import_optional_dependency(
    428     "pyarrow.parquet", extra="pyarrow is required for Parquet support."
    429 )
    431 path = _expand_user(path)
--> 432 table = _geopandas_to_arrow(
    433     df,
    434     index=index,
    435     geometry_encoding=geometry_encoding,
    436     schema_version=schema_version,
    437     write_covering_bbox=write_covering_bbox,
    438 )
    439 parquet.write_table(table, path, compression=compression, **kwargs)

File ~/Git/geopandas/geopandas/io/arrow.py:340, in _geopandas_to_arrow(df, index, geometry_encoding, schema_version, write_covering_bbox)
    336 from pyarrow import StructArray
    338 from geopandas.io._geoarrow import geopandas_to_arrow
--> 340 _validate_dataframe(df)
    342 if schema_version is not None:
    343     if geometry_encoding != "WKB" and schema_version != "1.1.0":

File ~/Git/geopandas/geopandas/io/arrow.py:246, in _validate_dataframe(df)
    244 # must have value column names (strings only)
    245 if df.columns.inferred_type not in {"string", "unicode", "empty"}:
--> 246     raise ValueError("Writing to Parquet/Feather requires string column names")
    248 # index level names must be strings
    249 valid_names = all(
    250     isinstance(name, str) for name in df.index.names if name is not None
    251 )

ValueError: Writing to Parquet/Feather requires string column names

In [5]: gdf.to_wkt().to_parquet("temp.pq")

We explicitly disallow that. It would be good to get rid of this limitation.

m-richards commented 1 month ago

I was having a quick look into this, it doesn't seem too hard to support. One question though, if this works and you do gdf_read = gpd.read_parquet("temp.pq") would you expect gdf_read.active_geometry_name to be "geometry" or ("geometry", "")? There's some discussion related to this over at #2088. I think it would be nice if this round tripped faithfully, but might be a bit harder to do

martinfleis commented 4 weeks ago

Whatever it was in the original gdf used to save the Parquet. But we may be limited here by the GeoParquet specification that may understand only a simple column name...