geopandas / dask-geopandas

Parallel GeoPandas with Dask
https://dask-geopandas.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
497 stars 44 forks source link

compute takes ages to produce the result. #314

Open emaildipen opened 3 hours ago

emaildipen commented 3 hours ago

I have a Dask GeoDataFrame, from which I extracted the geometry and performed infill using Shapely. I used geometry.interiors to set an area threshold and fill the holes. After that, I created a new geometry DataFrame. However, I don’t understand why it takes so long when I try to convert the Dask GeoSeries into a GeoSeries. Whenever I use the .compute() command, it takes ages—more than 12 hours. I thought something might be wrong with my approach.

martinfleis commented 3 hours ago

Please post the code you have used, not only its description.

emaildipen commented 2 hours ago
def fill_holes(geometry, min_hole_size):
    """
    Fill holes in a geometry (Polygon or MultiPolygon) if they are smaller than min_hole_size.
    """
    if geometry.geom_type == 'Polygon':
        if geometry.interiors:
            new_interiors = [interior for interior in geometry.interiors if Polygon(interior).area >= min_hole_size]
            return Polygon(geometry.exterior, new_interiors)
        else:
            return geometry
    elif geometry.geom_type == 'MultiPolygon':
        return unary_union([fill_holes(poly, min_hole_size) for poly in geometry])
    else:
        return geometry
    # Apply fill_holes function in parallel
filled = ddf.map_partitions(lambda ddf: ddf.geometry.apply(lambda geom: fill_holes(geom, min_hole_size)))

filled_ser=filled.compute()