cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.9k stars 3.78k forks source link

exec: filtering aggregations #39241

Open jordanlewis opened 5 years ago

jordanlewis commented 5 years ago

Add filtering aggregator support to the vectorized engine. This is an extra filter step that only includes rows in the aggregation if they pass a filter.

    // If set, this column index specifies a boolean argument; rows for which
    // this value is not true don't contribute to this aggregation. This enables
    // the filter clause, e.g.:
    //   SELECT SUM(x) FILTER (WHERE y > 1), SUM(x) FILTER (WHERE y < 1) FROM t
    FilterColIdx *uint32 `protobuf:"varint,4,opt,name=filter_col_idx,json=filterColIdx" json:"filter_col_idx,omitempty"`

Jira issue: CRDB-5586

rohany commented 5 years ago

@jordanlewis whats your status on the case operator? I feel like the strategy for repeating the batch as done there to keep adding to a selection vector could be good here.

jordanlewis commented 5 years ago

Status is that it's kinda on hold til the FK stuff gets finished, and hoping to get some more reviews on it too. I saw your review (thanks!) but it's kinda complicated and I'd appreciate some more scrutiny :)

jordanlewis commented 5 years ago

I agree though that your proposed strategy could work. It actually might be as simple as planning a selection right before the aggregation step actually runs.

rohany commented 5 years ago

yeah, thats what I was thinking. if we just select out the rows that don't pass the filter then we might be good. No worries about my review -- i don't have much authority over the codebase yet.

yuzefovich commented 4 years ago

50721 adds the support for filtering hash aggregation, but filtering ordered aggregation seems to be a lot harder (I think it might require plumbing the filtering logic right into the "heart" of each aggregate function).

One edge case is what happens when the whole group is filtered out? We still need to let aggregate function populate the output for such group, otherwise there will be a misalignment between any_not_null and all other functions. I think if we don't have any FILTER clauses, handling DISTINCT is relatively easy, but FILTER clause makes it extremely hard to not plumb the logic into each aggregate function.

It's been a while since I struggled with that (but I did spend like a day trying to get it working in all cases), and here is a snippet of what I had in the end. I can't easily describe why I came to realization that there is no way around plumbing the FILTERing logic into the aggregate function because I forgot and lost all the context, but at the time I was certain of it.

In the code snippet I was trying to modify the groups vector so that FILTERing would work in all cases (another case that comes to mind: how do you handle the scenario when the "head" of the group is filtered out? (the "head" is the first tuple in the group for which groups[i]==true)).

github-actions[bot] commented 1 year ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!