geopandas / dask-geopandas

Parallel GeoPandas with Dask
https://dask-geopandas.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
486 stars 45 forks source link

dask geopandas to parquet does not seem to persist spatial paritions #260

Open v2thegreat opened 9 months ago

v2thegreat commented 9 months ago

Problem

When using Dask GeoPandas to write a GeoDataFrame to Parquet format, the spatial partitions appear not to be persisted correctly. This issue is observed when storing GeoPandas data with spatial information using Dask and the Parquet format.

Expected Behavior

The spatial partitions of the GeoDataFrame should be correctly persisted in the resulting Parquet files. This means that the spatial properties of the GeoDataFrame, such as geometry information, should be preserved during the conversion process, making it much faster to query.

Steps to Reproduce

  1. Create a GeoDataFrame with Dask GeoPandas.
  2. Attempt to write the GeoDataFrame to Parquet format using Dask.
  3. Read the Dask GeoDataFrame.
  4. Observe that the resulting Parquet files do not seem to persist spatial partitions correctly.

Example Code

import dask_geopandas as dg
import geopandas as gpd

# Create a GeoDataFrame with Dask GeoPandas
sample_data = dg.from_geopandas(gpd.read_file('path/to/shapefile.shp'), npartitions=4)
sample_data = sample_data.spatial_shuffle(shuffle='tasks')
sample_data.spatial_partitions.explore() # visualize the spatial partitions here

# Write the GeoDataFrame to Parquet
sample_data.to_parquet('path/to/output', write_metadata_file=True)

sample_data_reloaded = dg.read_parquet('path/to/output', gather_spatial_partitions=True)
sample_data_reloaded.spatial_partitions # None

My end goal is to be able to query the data quickly and only grab the partitions that contain the bounds that I'm interested in, such as when using clip or cx

TomAugspurger commented 6 months ago

I think this is working properly with the latest versions of these libraries:

dask 2024.2.1, pyarrow 15.0.0, geopandas 0.14.3, and dask-geopandas main. Can you try this script?

import geodatasets
import tempfile
import dask_geopandas
import geopandas

df = geopandas.read_file(geodatasets.get_url("geoda airbnb"))
ddf = dask_geopandas.from_geopandas(df, npartitions=2).spatial_shuffle()
ddf.to_parquet("/tmp/out.parquet")
dask_geopandas.read_parquet("/tmp/data.parquet").spatial_partitions

For me that outputs

0    POLYGON ((-87.59852 41.77113, -87.59852 42.023...
1    POLYGON ((-87.52414 41.64454, -87.52414 41.977...
dtype: geometry