geopandas / dask-geopandas

Parallel GeoPandas with Dask
https://dask-geopandas.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
488 stars 45 forks source link

Boolean indexing with Dask object causes conversion of Dask-Geopandas object to Dask object #224

Open gcaria opened 1 year ago

gcaria commented 1 year ago
import geopandas
import dask_geopandas
from shapely.geometry import Point, box

xs = ys = range(5)
geom = [Point(x,y) for (x,y) in zip(xs, ys)]
gser = gpd.GeoSeries(geom, crs=4326)
dgser = dask_geopandas.from_geopandas(gser, npartitions=4)

poly = box(1,1,3,3)

type(gser[gser.within(poly)])
# <class 'geopandas.geoseries.GeoSeries'>

type(gser[dgser.within(poly)])
# <class 'geopandas.geoseries.GeoSeries'>

type(dgser[dgser.within(poly)])
# <class 'dask.dataframe.core.Series'>

type(dgser[dgser.within(poly).compute()])
# <class 'dask_geopandas.core.GeoSeries'>

The title might not be accurate since it is mostly my guess at what's going on here, but I would expect to still have a dask-geopandas object after indexing.

I'm using version 0.2.0 of dask-geopandas.

alejohz commented 1 year ago

Hey i did some exploring and found out some interesting stuff.

import geopandas as gpd
import numpy as np
import dask_geopandas
from shapely.geometry import box

poly = box(1,1,3,3)

N = 10_000_000

points = gpd.GeoDataFrame(geometry=gpd.points_from_xy(np.random.randn(N),np.random.randn(N)))
dpoints =  dask_geopandas.from_geopandas(points, npartitions=16)
type(dpoints.within(poly))
# dask.dataframe.core.Series
type(dpoints[dpoints.within(poly).compute()])
# dask_geopandas.core.GeoDataFrame

spoints = gpd.GeoSeries(gpd.points_from_xy(np.random.randn(N),np.random.randn(N)), crs=4326)
gsdpoints =  dask_geopandas.from_geopandas(spoints, npartitions=16)
type(gsdpoints.within(poly))
# dask.dataframe.core.Series
type(gsdpoints[dpoints.within(poly).compute()])
# dask.dataframe.core.Series

This only occurs when dealing with GeoSeries, as GeoDataFrame is working as expected. I found out something else too and it is that GeoSeries gets transformed back to dask.series even without boolean indexing but with some operation done on the series.

type(gsdpoints.apply(lambda x: x, meta=(None, 'geometry')))
# dask.dataframe.core.Series

I have tried to find something inside the _Frame super class, with no results. Maybe here we need someone from the core team to check it out. I'm using dask_geopandas==0.2.0

gcaria commented 1 year ago

I agree. I wonder if the fix is as easy as what's done in geopandas to avoid falling back to pandas objects (see here)

def _wrapped_pandas_method(self, mtd, *args, **kwargs):
    """Wrap a generic pandas method to ensure it returns a GeoSeries"""
    val = getattr(super(), mtd)(*args, **kwargs)
    if type(val) == Series:
        val.__class__ = GeoSeries
        val.crs = self.crs
    return val

def __getitem__(self, key):
    return self._wrapped_pandas_method("__getitem__", key)
alejohz commented 1 year ago

This makes sense, Geopandas has no wrapping method for GeoDataFrames, but it does, as you show, for GeoSeries.

I would be willing to work on a PR that addresses this problem in the coming days.