Open hendrikmakait opened 9 months ago
This is a bit harder to detect though, this is actually something where the cardinality is needed to make a more informed decision
This one turns out to be a bit trickier. There's an instance of a groupby with many unique values, as well as a join with a single value of the (partitioned) nations dataset (#1380). However, applying the trivial fixes (split_out=True
and broadcast=True
) does not solve the memory issue completely. I suspect that imbalanced partition sizes are also to blame (https://github.com/coiled/benchmarks/issues/1367#issuecomment-1936080012).
As it turns out, broadcast=True
does not work because of https://github.com/dask-contrib/dask-expr/pull/871.
The broadcast flag wasn't properly preserved when pushing filters down, this is probably why that looked weird for @hendrikmakait
Pr to fix is here: https://github.com/dask-contrib/dask-expr/pull/871
Have to rerun after that one is in
@phofl: This looks much better now, thanks! https://cloud.coiled.io/clusters/383307/information?tab=Metrics
This looks like another instance of the problem in #1376. We end up with a groupby-aggregate that leaves us with ~30M groups in a single partition.
Edit (Patrick): Query 13 has exactly the same issue