geopandas / dask-geopandas

Parallel GeoPandas with Dask
https://dask-geopandas.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
486 stars 45 forks source link

`ValueError: 'left_df' should be GeoDataFrame, got <class 'tuple'>` using sjoin after `spatial_shuffle` #303

Closed timdhu closed 1 week ago

timdhu commented 1 month ago

Hello,

I'm getting an error when I try to use sjoin after using spatial_shuffle.

Here are the packages in my environment:

Package            Version
------------------ -----------
certifi            2024.7.4
charset-normalizer 3.3.2
click              8.1.7
cloudpickle        3.0.0
dask               2024.7.0
dask-expr          1.1.7
dask-geopandas     0.4.1
distributed        2024.7.0
fsspec             2024.6.1
geodatasets        2024.7.0
geopandas          1.0.1
idna               3.7
importlib_metadata 8.0.0
Jinja2             3.1.4
locket             1.0.0
MarkupSafe         2.1.5
msgpack            1.0.8
numpy              2.0.0
packaging          24.1
pandas             2.2.2
partd              1.4.2
pip                24.0
platformdirs       4.2.2
pooch              1.8.2
psutil             6.0.0
pyarrow            17.0.0
pyogrio            0.9.0
pyproj             3.6.1
python-dateutil    2.9.0.post0
pytz               2024.1
PyYAML             6.0.1
requests           2.32.3
setuptools         69.2.0
shapely            2.0.5
six                1.16.0
sortedcontainers   2.4.0
tblib              3.0.0
toolz              0.12.1
tornado            6.4.1
tzdata             2024.1
urllib3            2.2.2
zict               3.0.0
zipp               3.19.2

(I ran pip install dask-geopandas and pip install geodatasets in a fresh venv)

Then I run the following code:

import geopandas
import geodatasets
import dask_geopandas

colombia = geopandas.read_file(geodatasets.get_path('geoda.malaria'))
colombia = dask_geopandas.from_geopandas(colombia, npartitions=4)
colombia = colombia.spatial_shuffle()
colombia.sjoin(colombia).compute()

and get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../env/lib/python3.11/site-packages/dask_expr/_collection.py", line 476, in compute
    return DaskMethodsMixin.compute(out, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../env/lib/python3.11/site-packages/dask/base.py", line 376, in compute
    (result,) = compute(self, traverse=False, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../env/lib/python3.11/site-packages/dask/base.py", line 662, in compute
    results = schedule(dsk, keys, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../env/lib/python3.11/site-packages/geopandas/tools/sjoin.py", line 114, in sjoin
    _basic_checks(left_df, right_df, how, lsuffix, rsuffix, on_attribute=on_attribute),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../env/lib/python3.11/site-packages/geopandas/tools/sjoin.py", line 164, in _basic_checks
    raise ValueError(
ValueError: 'left_df' should be GeoDataFrame, got <class 'tuple'>

If I remove the spatial_shuffle then the code runs as expected.

I've dug around a bit to try and understand what's going on. If I disable query planning from dask before I import dask-geopandas then the code runs:

from dask import config
config.get("dataframe")["query-planning"] = False

It looks like something in the condition in line 117 of sjoin.py is causing an issue

https://github.com/geopandas/dask-geopandas/blob/d84e29902b1ec43522c397f8086eebf1ec90182d/dask_geopandas/sjoin.py#L117

TomAugspurger commented 1 month ago

Thanks for the report. Do you happen to have a minimal reproducible example you can share?

timdhu commented 1 month ago

Sure. I created a new environment

python3.11 -m venv env 

Then activated it

source env/bin/activate

Installed dask-geopandas and any dependencies

pip install dask-geopandas

Then I run the following code in the environment:

from dask_geopandas import from_geopandas
from geopandas import GeoDataFrame
from shapely import box 

shape = box(-74.5, -74.0, 4.5, 5.0)
shape = GeoDataFrame(geometry=[shape])
shape = from_geopandas(shape, npartitions=1)
shape = shape.spatial_shuffle()
shape.sjoin(shape).compute()

which produces the error I mentioned above. I'm running this on an Apple M1X, but I get the same issue in a linux docker container that is built remotely.

martinfleis commented 1 month ago

I can reproduce it. The tuple in question is ('set_geometry-49fab61a205e66c3ff77166c562d931f', 0).

hvasbath commented 1 month ago

Having the same issue here +1! Thanks for looking into it!

hvasbath commented 1 week ago

I would like to contribute in fixing this- can anyone guide me a little? I already have a running failing test. Much appreciated!

TomAugspurger commented 1 week ago

I was hoping to look into this sometime this week too. I think two paths forward are to

  1. Try to create a dask / dask-expr only reproducer of the error (by building a similar HighLevelGraph as dask-geopandas does). My initial attempt at this failed to reproduce the error. Then dig into dask-expr to see where the key (name, partition_number) isn't being properly converted to a data value.
  2. See whether dask-expr has a better way to build this task graph

I'm not sure which of these is more promising.

TomAugspurger commented 1 week ago

https://github.com/dask/dask-expr/issues/1129 has the reproducer. The key was a .shuffle() call (without that .shuffle or .spatial_shuffle, it doesn't reproduce the error).