Both queries have a common pattern: They do a groupby aggregation over one of the inputs and then merge the result of this groupby aggregation with the input itself.
The groupby aggregation does not really reduce the data, the column that is grouped over has a lot of unique values, so we collect a lot of data on each worker for the tree reduction which causes the OOM issues.
The trivial fix we can do now to get some data is to set split_out=True in the groupby aggregation, short/medium term we want to identify these patterns in the optimiser and just set split_out=True automatically in those cases. We have to shuffle on the grouping column anyway, so we can just do this earlier to avoid the pattern we are seeing here.
Both queries have a common pattern: They do a groupby aggregation over one of the inputs and then merge the result of this groupby aggregation with the input itself.
The groupby aggregation does not really reduce the data, the column that is grouped over has a lot of unique values, so we collect a lot of data on each worker for the tree reduction which causes the OOM issues.
The trivial fix we can do now to get some data is to set
split_out=True
in the groupby aggregation, short/medium term we want to identify these patterns in the optimiser and just set split_out=True automatically in those cases. We have to shuffle on the grouping column anyway, so we can just do this earlier to avoid the pattern we are seeing here.cc @hendrikmakait