apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.5k stars 1.29k forks source link

Multiple Grouping Sets and Familia Support #8060

Open atris opened 2 years ago

atris commented 2 years ago

This issue tracks the development of the feature that brings multiple grouping sets to Pinot.

What is A Grouping Set?

Consider the following query:

SELECT
    brand,
    segment,
    SUM (quantity)
FROM
    sales
GROUP BY
    brand,
    segment;

(brand, segment) represents a single grouping set.

A query using multiple grouping sets would be represented as:

SELECT
    c1,
    c2,
    aggregate_function(c3)
FROM
    table_name
GROUP BY
    GROUPING SETS (
        (c1, c2),
        (c1),
        (c2),
        ()
);

An equivalent query using UNION ALL would be:

SELECT
    brand,
    segment,
    SUM (quantity)
FROM
    sales
GROUP BY
    brand,
    segment

UNION ALL

SELECT
    brand,
    NULL,
    SUM (quantity)
FROM
    sales
GROUP BY
    brand

UNION ALL

SELECT
    NULL,
    segment,
    SUM (quantity)
FROM
    sales
GROUP BY
    segment

UNION ALL

SELECT
    NULL,
    NULL,
    SUM (quantity)
FROM
    sales;

GROUPING SETS also allows empty sets () which is equivalent of SELECT * FROM foo;

CUBE and ROLLUP

CUBE(c1, c2, c3) generates:

(c1, c2, c3)
(c1, c2)
(c2, c3)
(c1,c3)
(c1)
(c2)
(c3)
()

ROLLUP(c1, c2,c3) generates:

(c1, c2, c3)
(c1, c2)
(c1)
()

ROLLUP generates groups in hierarchy vs. CUBE generating all groups.

Design

A design document shall soon be published but the design theme will be to use the swim lane concept introduced in the FILTER PR. An important design goal is to avoid rescans.

Implementation Plan

The implementation plan will be to first support ROLLUP, then CUBE and then generic GROUPING sets.

atris commented 2 years ago

I have started working on this and aim to publish a design document by end of this week

walterddr commented 2 years ago

Hi @atris . This looks like a super powerful feature. thanks for working on this.

on the higher level, would you please describe briefly the scope of this design? specifically

  1. is it more on the syntactically support (e.g. I think calcite parser natively supports the 3 concepts, see: https://calcite.apache.org/docs/reference.html#groupItems) or more on how to design the aggregation operators to carry out the compute?
  2. for the equivalent UNION ALL syntax, I don't think you can achieve that with one simple scatter-gather execution. were you planning to support this by issuing multiple brokerRequests with different group by keys?

thanks in advance.

siddharthteotia commented 2 years ago

Not sure if this is similar / overlapping but linking the issue here for reference - https://github.com/apache/pinot/issues/8040

Looking forward to the design doc