coiled / benchmarks

BSD 3-Clause "New" or "Revised" License
32 stars 17 forks source link

[TPC-H] Query 11 and 13 memory issue #1382

Open hendrikmakait opened 9 months ago

hendrikmakait commented 9 months ago

This looks like another instance of the problem in #1376. We end up with a groupby-aggregate that leaves us with ~30M groups in a single partition.

Edit (Patrick): Query 13 has exactly the same issue

phofl commented 9 months ago

This is a bit harder to detect though, this is actually something where the cardinality is needed to make a more informed decision

hendrikmakait commented 9 months ago

This one turns out to be a bit trickier. There's an instance of a groupby with many unique values, as well as a join with a single value of the (partitioned) nations dataset (#1380). However, applying the trivial fixes (split_out=True and broadcast=True) does not solve the memory issue completely. I suspect that imbalanced partition sizes are also to blame (https://github.com/coiled/benchmarks/issues/1367#issuecomment-1936080012).

hendrikmakait commented 9 months ago

As it turns out, broadcast=True does not work because of https://github.com/dask-contrib/dask-expr/pull/871.

phofl commented 9 months ago

The broadcast flag wasn't properly preserved when pushing filters down, this is probably why that looked weird for @hendrikmakait

Pr to fix is here: https://github.com/dask-contrib/dask-expr/pull/871

Have to rerun after that one is in

hendrikmakait commented 9 months ago

@phofl: This looks much better now, thanks! https://cloud.coiled.io/clusters/383307/information?tab=Metrics