Closed lukehan closed 9 years ago
SQL has some nice extensions to its GROUP BY syntax that allow you to express grouping sets concisely. Some examples:
GROUP BY x, y, z
defines a single grouping setGROUP BY GROUPING SETS (x, y), (x, z)
defines 2 grouping sets (x, y) and (x, z)GROUP BY ROLLUP (x, y, z)
defines 4 grouping sets (x, y, z), (x, y), (x) and ()GROUP BY CUBE (x, y, z)
defines 8 grouping sets (x, y, z), (x, y), (x, z), (x), (y, z), (y), (z), ()You can nest them, and if you specify multiple elements they are combined using cartesian product. In particular, ROLLUP with cartesian product allows you to write GROUP BY ROLLUP(year, quarter, month, day), ROLLUP(nation, state, zipcode)
as shorthand for 20 grouping sets based on hierarchies.
Would it be useful to define whitelists using the same algebra as SQL uses? After all, cuboids are basically materialized GROUP BY CUBE queries.
hi @julianhyde
It seems to me "GROUP BY GROUPING SETS" corresponds to the whitelist idea, and "GROUP BY ROLLUP" together with "GROUP BY CUBE" correspond to the "hierarchy" and "aggregation group" in existing Kylin cube definition.
It's tempting if we could use SQL algrebra to express the cube definition. But a.f.a.i.k, the SQL grouping is only for query use in SQL, how can we leverage it for guiding cuboid selection? The cube building step happens before any query being submitted.
Ask yourself how much use Kylin makes of hierarchies. For years I thought they were a fundamental part of OLAP, and now I think they are just a hint for attributes that tend to be used together (e.g. if year is used, month is more likely to be used; if month is used, year is definitely also used).
If they are just hints, I wouldn't make them a fundamental part of the data model. But they are useful, so white-lists seems a good place for them.
SQL is not just for queries. It is a powerful language for defining data structures, because it maps onto relational algebra. Since the definition of a cuboid can be expressed equivalently as a SQL materialized view, why not represent it as a SQL materialized view?
The GROUP BY syntax is a very good candidate to define the whitelist of cuboids. It's concise and well known to many people, which is a big advantage.
One note is the current aggregation group and hierarchy is similar but not exactly the same as the RULLUP and CUBE. If both coexist, they may confuse users and complicate the system.
Ideally we keep only one syntax or a little bit of both world. Need to think more about it.
Basically, you want to express the conjunction (AND) and disjunction (OR) relationship between different dimension groups.
Could you use the "conjunction group" and "disjunction group" to replace the "aggregation group" and "white list"?
Moreover, you can use "conjunction group" and "disjunction group" as internal data structure, and use SQL GROUP BY syntax as external expression. Then you can translate the external expression to internal data structure.
If a cube has N dimensions then there are 2^N possible cuboids -- the power set. Yes, it is a disjunction of conjunctions; Calcite uses a sorted set of bit-sets, which expresses the same thing compactly. Whether you use GROUPING SETS, CUBE, ROLLUP, or something else based on set theory, it will all boil down to a subset of the power set.
Proposal from Hongbin about Cuboid White List:
Logically, a cube contains cuboids representing all combinations of dimensions. Apparently, a naive cube building strategy that materializes all cuboids will easily meet curse-of-dimension problems. Currently Kylin leverages a strategy called "aggregation groups" to reduce the number of cuboids need being materialized.
However, if the query pattern is simple and fixed, the "aggregation group" strategy is still not efficient enough. For example, suppose there're five dimensions, namely A,B,C,D and E. The data modeler is sure that only combinations (A,B,C), (D,E), (A,E) will be queried, so he’ll use the aggregation group tool to optimize his cube definition. However, whatever aggregation group he chooses, lots of useless combinations would be materialized.
With a new strategy called "cuboid whitelist", data modelers can guide Kylin to only materialize the cuboids he's interested in. Depending on the whitelist, Kylin will materialize the minimal set of cuboids to cover each cuboid in the whitelist. To support this, the following functionalities should be added:
Hongbin Ma