coiled / benchmarks

BSD 3-Clause "New" or "Revised" License
32 stars 17 forks source link

Query 17 and 18 memory issues #1376

Open phofl opened 9 months ago

phofl commented 9 months ago

Both queries have a common pattern: They do a groupby aggregation over one of the inputs and then merge the result of this groupby aggregation with the input itself.

The groupby aggregation does not really reduce the data, the column that is grouped over has a lot of unique values, so we collect a lot of data on each worker for the tree reduction which causes the OOM issues.

The trivial fix we can do now to get some data is to set split_out=True in the groupby aggregation, short/medium term we want to identify these patterns in the optimiser and just set split_out=True automatically in those cases. We have to shuffle on the grouping column anyway, so we can just do this earlier to avoid the pattern we are seeing here.

cc @hendrikmakait