dask / dask-expr

BSD 3-Clause "New" or "Revised" License
83 stars 22 forks source link

Latest dask errors when doing named aggregation? #1008

Closed aimran-adroll closed 6 months ago

aimran-adroll commented 6 months ago

Describe the issue:

I recently moved from dask==2023.12.0 to dask==2024.4.0. I noticed that I can no longer to ..agg(new_name=func). Is this an expected regression?

Minimal Complete Verifiable Example:

import dask
import pandas as pd
import dask.dataframe as dd
pdf = pd.DataFrame(
        {
            "A": ["1", "2", "3", "1", "2", "3", "1", "2", "4"],
            "B": [-0.776, -0.4, -0.873, 0.054, 1.419, -0.948, -0.967, -1.714, -0.666],
            "C": "foo"
        }
    ).astype({
        "A": "string",
        "B": "float64",
        "C": "string"
    })

ddf = dd.from_pandas(pdf, npartitions=4)

## dask==2023.12.0 
ddf.groupby("A")["B"].agg(l=max).reset_index().compute()
   A      l
0  1  0.054
1  2  1.419
2  3 -0.873
3  4 -0.666

## dask==2024.4.0
ddf.groupby("A")["B"].agg(l=max).reset_index().compute()
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[3], line 1
----> 1 ddf.groupby("A")["B"].agg(l=max).reset_index().compute()

File /private/tmp/goo/.venv/lib/python3.12/site-packages/dask_expr/_groupby.py:1909, in GroupBy.agg(self, *args, **kwargs)
   1908 def agg(self, *args, **kwargs):
-> 1909     return self.aggregate(*args, **kwargs)

File /private/tmp/goo/.venv/lib/python3.12/site-packages/dask_expr/_groupby.py:1888, in GroupBy.aggregate(self, arg, split_every, split_out, shuffle_method, **kwargs)
   1883 @_aggregate_docstring(based_on="pd.core.groupby.DataFrameGroupBy.agg")
   1884 def aggregate(
   1885     self, arg=None, split_every=8, split_out=None, shuffle_method=None, **kwargs
   1886 ):
   1887     if arg is None:
-> 1888         raise NotImplementedError("arg=None not supported")
   1890     if arg == "size":
   1891         return self.size()

NotImplementedError: arg=None not supported

# Only, this works :-/ 
# forcing a potato renaming for every aggregation
ddf.groupby("A")["B"].agg(max).reset_index().compute()
   A      B
0  1  0.054
1  2  1.419
2  3 -0.873
3  4 -0.666

Anything else we need to know?:

I tried setting dask.config.set({'dataframe.query-planning': False}) but that does not help

Environment:

Python implementation: CPython
Python version       : 3.10.13
IPython version      : 8.18.1

Compiler    : Clang 16.0.3 
OS          : Darwin
Release     : 23.3.0
Machine     : arm64
Processor   : arm
CPU cores   : 8
Architecture: 64bit

pandas: 2.1.4
dask  : 2024.4.0   OR 2023.12.0
phofl commented 6 months ago

sorry about that, this fell through the cracks somehow. Will have a fix out soonish

phofl commented 6 months ago

1.0.9 has the fix and should make your examples work (I accidentally skipped 1.0.8...)

aimran-adroll commented 6 months ago

Whoa!!! Thanks again Patrick!

Confirming that it fixed my issue