dask / dask

Parallel computing with task scheduling
https://dask.org
BSD 3-Clause "New" or "Revised" License
12.36k stars 1.69k forks source link

Groupby Quantile #9824

Open patcao opened 1 year ago

patcao commented 1 year ago

Similar to #8658, I'd like to be able to use quantile on groupby Series and DataFrame objects. This works in Pandas but not Dask.

import pandas as pd
import dask.dataframe as dd

pdf = pd.DataFrame(
    {"a": [0, 1, 0, 1, 1, 0], "b": range(6)}
)

ddf = dd.from_pandas(pdf, 2)

# Pandas Quantile Func
print(pdf.groupby('a').b.quantile(0.2))

# Dask Quantile Func
ddf.groupby('a').b.quantile(0.2)

a
0    0.8
1    1.8
Name: b, dtype: float64
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[<ipython-input-27-7ab35545b918>](https://localhost:8080/#) in <module>
     13 
     14 # Dask Quantile Func
---> 15 ddf.groupby('a').b.quantile(0.2)

AttributeError: 'SeriesGroupBy' object has no attribute 'quantile'
dask                          2022.2.1
pandas                        1.3.5
jrbourbeau commented 1 year ago

Thanks for the issue @patcao. Adding groupby.quantile definitely seems in scope. We recently added gourpby.median (xref https://github.com/dask/dask/pull/9516) -- I'd expect us to be able to use/reuse some of that logic for quantile