Open dbalabka opened 2 months ago
Thanks for the report.
I think resample isn't implemented for groupby in Dask. The error message could certainly be better, adding it would also be fine
@phofl , thanks for the quick reply. Is there any workaround to run arbitrary Pandas functions on groups, like map_partition
? Due to the group being distributed over a cluster, I need to make something smarter. The resample needs to have all rows in one partition to fill the gaps.
Do you want the whole group in a single partition? If yes, you can use groupby.apply / groupby.transform
@phofl, it works, thanks:
import pandas as pd
import dask.dataframe as dd
data = {
'id': [1, 1, 1, 2, 2, 2],
'date': pd.to_datetime(['2023-01-01', '2023-01-04', '2023-01-05', '2023-01-01', '2023-01-04', '2023-01-05']),
'metric': [1,1,1,1,1,1]
}
df = dd.from_pandas(pd.DataFrame(data).astype({'id': 'int64[pyarrow]', 'metric': 'int64[pyarrow]', 'date': 'timestamp[ns][pyarrow]'}))
print(
df
.groupby(by=['id'])
.apply(lambda x: x.resample("D", on="date").sum(), include_groups=False, meta={"metric": "int64[pyarrow]"})
.reset_index()
)
FYI, for those who came across this ticket. It was a bit unexpected for me that Dask keeps one group in a single partition, which means we can lead to OOM if a group is too large, and we should keep it in mind while grouping. Otherwise, we should do this: https://stackoverflow.com/a/55881591/983577
Apply and transform are doing this specifically, there is no way around that fwiw
@phofl , I've spent a lot of time working around the missing resample
method. Found another bug that prevents me from using apply
on groups:
https://github.com/dask/dask/issues/11394
that's odd, thanks for digging these up, I'll try to take a look tomorrow
@phofl, it might be related to a bug in pandas that I spoted during this investigation: https://github.com/pandas-dev/pandas/issues/59823 Pandas produces the empty column name that can affect Dask logic
Here is a work around that only work in my case:
import pandas as pd
import dask.dataframe as dd
data = {
'id': [1, 1, 1, 2, 2, 2],
'date': pd.to_datetime(['2023-01-01', '2023-01-04', '2023-01-05', '2023-01-01', '2023-01-04', '2023-01-05']),
'metric': [1,1,1,1,1,1]
}
df = dd.from_pandas(pd.DataFrame(data).astype({'id': 'int64[pyarrow]', 'metric': 'int64[pyarrow]', 'date': 'timestamp[ns][pyarrow]'}))
print(
df
# Partitioning by id as a replacement for groupby
.set_index('id')
# See bug: https://github.com/pandas-dev/pandas/issues/59823
.astype({'date': 'datetime64[ns]'})
# Apply the required Pandas function on each partition. Previously, set index guarantee us that each partition has all required rows
.map_partitions(
lambda x: x.groupby('id').resample("D", on="date").sum().reset_index(),
meta={'id': 'int64[pyarrow]', 'date': 'timestamp[ns][pyarrow]', 'metric': 'int64[pyarrow]'},
)
# Remove unnecessary index
.reset_index(drop=True)
.compute()
)
Describe the issue:
Minimal Complete Verifiable Example:
Environment:
dask-expr==1.1.10