geopandas / dask-geopandas

Parallel GeoPandas with Dask
https://dask-geopandas.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
505 stars 45 forks source link

Dissolve using dask-geopandas #313

Open emaildipen opened 4 weeks ago

emaildipen commented 4 weeks ago

I have around 200k polygons in a shapefile, and I want to dissolve the polygons that are connected to each other. ArcGIS offers simple techniques to achieve this, but I was wondering if there are quicker ways to do it. I’ve tried the following but it took ages to execute.

import dask-geopands as dd

# Read the shapefile
ddf = dd.read_file(input_shapefile, npartitions=10)

# Dissolve polygons that are connected with each other
ddf['dissolve'] = 1  # Create a dummy column for dissolving
dissolved_gdf = ddf.dissolve('dissolve', split_out=11, sort=False)

# Explode the dissolved multipolygon into individual polygons and reset index
dissolved_gdf = dissolved_gdf.explode().reset_index(drop=True)

# Add an index column
dissolved_gdf['index'] = dissolved_gdf.index

dissolved_gdf.compute().to_file(output_shapefile_filled, use_arrow=True)
martinfleis commented 3 weeks ago

Do you need dask-geopandas? Because if you are fine with vanilla geopandas, it will be much easier. And 200k should be perfectly fine.

You need to identify connected components and dissolve by a component label. That is tricky in distributed setting. But in a single GeoDataFrame, it is easy with the help of libpysal / (or scipy only).

from libpysal import graph

comp_label = graph.Graph.build_contiguity(gdf, rook=False).component_labels

gdf.dissolve(comp_label)

If you know that you have a correct polygonal coverage, you can even use much faster coverage union.

gdf.dissolve(comp_label, method="coverage")
emaildipen commented 3 weeks ago

Thanks! Yes, I do need Dask since I’ll be processing millions of polygons. I added map_partitions to my function, and it worked. However, now the problem is that it’s taking a long time to transfer it to a GeoPandas DataFrame.

martinfleis commented 3 weeks ago

map_partitions will work only if you ensure that a single component is always within a single partition. If it stretches across multiple, the approach will not work.