geopandas / dask-geopandas

Parallel GeoPandas with Dask
https://dask-geopandas.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
503 stars 44 forks source link

ENH: expose additional kwargs in explode method to mirror GeoPandas API #246

Closed ebkurtz closed 6 months ago

ebkurtz commented 1 year ago

Expose index_parts, column, and ignore_index parameters in the .explode() method to mirror GeoPandas API. This allows supressing FutureWarning for index_parts, see #245

ebkurtz commented 1 year ago

Failing on GeoSeries since the Column argument is only valid for DataFrame.

TomAugspurger commented 1 year ago

Can you define it on Series and DataFrame separately like in Dask?

TomAugspurger commented 1 year ago

The CI failure looks unrelated. I'll take a look at that later.

It might be worth adding a small unit test, just to verify that things are passing through correctly (and to make sure they don't break in the future).

ebkurtz commented 1 year ago

Thanks for the tip on defining them separately. I'll look at creating a unit test.

ebkurtz commented 1 year ago

Would I add that as a new function in test_core.py are as additional code in the existing functions? I haven't done much with unit tests.

TomAugspurger commented 1 year ago

Yes, a new function. You might be able to adapt a geopandas test like https://github.com/geopandas/geopandas/blob/dc695ee93113af5d1a93a5740ecd792c2b9d19b3/geopandas/tests/test_geom_methods.py#L1003.

ebkurtz commented 1 year ago

The unit tests for the ignore_index argument are failing. With ignore_index set to True, the index is reset for each map partition so the resulting GeoDataFrame index has a non-unique index that doesn't match the vanilla GeoPandas call:

e.g. with

def test_explode_geoseries_ignore_index():
    s = geopandas.GeoSeries(
        [MultiPoint([(0, 0), (1, 1)]), MultiPoint([(2, 2), (3, 3), (4, 4)])]
    )
    # Change index from range index to test ignore index argument
    s.index = [3, 4]
    original = s.explode(ignore_index=True)
    dask_s = dask_geopandas.from_geopandas(s, npartitions=2)
    daskified = dask_s.explode(ignore_index=True)
    assert isinstance(daskified, dask_geopandas.GeoSeries)
    assert all(original == daskified.compute())

returns error ValueError: Can only compare identically-labeled Series objects

self = 0    POINT (0.00000 0.00000)
1    POINT (1.00000 1.00000)
2    POINT (2.00000 2.00000)
3    POINT (3.00000 3.00000)
4    POINT (4.00000 4.00000)
dtype: geometry

other = 0    POINT (0.00000 0.00000)
1    POINT (1.00000 1.00000)
0    POINT (2.00000 2.00000)
1    POINT (3.00000 3.00000)
2    POINT (4.00000 4.00000)

My inclination is to reset the index after calling compute on the Dask geoseries:

 assert all(original == daskified.compute().reset_index(drop=True))

but I'm not familiar enough with unit testing and dask to know if that's an appropriate way to handle it.

ebkurtz commented 1 year ago

Getting a lot of test failures after refreshing my fork. Looks like something with pyarrow strings?

jorisvandenbossche commented 6 months ago

Getting a lot of test failures after refreshing my fork. Looks like something with pyarrow strings?

Apologies for the slow follow-up. Indeed those failures were cased by something else, which has been fixed in the meantime.

I also added the equivalent change to the new dask-expr based implementation.

jorisvandenbossche commented 6 months ago

The unit tests for the ignore_index argument are failing.

Your fix to create the expected result manually with the different index was still failing for the older geopandas or dask version. But so just added a skip for those older versions, as I don't think it is worth complicating the test for that.

jorisvandenbossche commented 6 months ago

@ebkurtz Thanks again for the PR and apologies it took so long to get merged!