KylinOLAP / Kylin

This code base is retained for historical interest only, please visit Apache Incubator Repo for latest one
https://github.com/apache/incubator-kylin
Apache License 2.0
563 stars 225 forks source link

Kylin Cuboid Whitelist #263

Closed lukehan closed 9 years ago

lukehan commented 9 years ago

Proposal from Hongbin about Cuboid White List:

Logically, a cube contains cuboids representing all combinations of dimensions. Apparently, a naive cube building strategy that materializes all cuboids will easily meet curse-of-dimension problems. Currently Kylin leverages a strategy called "aggregation groups" to reduce the number of cuboids need being materialized.

However, if the query pattern is simple and fixed, the "aggregation group" strategy is still not efficient enough. For example, suppose there're five dimensions, namely A,B,C,D and E. The data modeler is sure that only combinations (A,B,C), (D,E), (A,E) will be queried, so he’ll use the aggregation group tool to optimize his cube definition. However, whatever aggregation group he chooses, lots of useless combinations would be materialized.

With a new strategy called "cuboid whitelist", data modelers can guide Kylin to only materialize the cuboids he's interested in. Depending on the whitelist, Kylin will materialize the minimal set of cuboids to cover each cuboid in the whitelist. To support this, the following functionalities should be added:

  1. Front-end/UI for specifying whitelist members, and persistent them to cube description.
  2. Enhanced job engine scheduler that will calculate a minimal spanning build tree based on the whitelist.
  3. (OPTIONAL) Enhanced job engine to support dynamic whitelist, trigger new builds for lately added whitelist members.

Hongbin Ma

julianhyde commented 9 years ago

SQL has some nice extensions to its GROUP BY syntax that allow you to express grouping sets concisely. Some examples:

You can nest them, and if you specify multiple elements they are combined using cartesian product. In particular, ROLLUP with cartesian product allows you to write GROUP BY ROLLUP(year, quarter, month, day), ROLLUP(nation, state, zipcode) as shorthand for 20 grouping sets based on hierarchies.

Would it be useful to define whitelists using the same algebra as SQL uses? After all, cuboids are basically materialized GROUP BY CUBE queries.

binmahone commented 9 years ago

hi @julianhyde

It seems to me "GROUP BY GROUPING SETS" corresponds to the whitelist idea, and "GROUP BY ROLLUP" together with "GROUP BY CUBE" correspond to the "hierarchy" and "aggregation group" in existing Kylin cube definition.

It's tempting if we could use SQL algrebra to express the cube definition. But a.f.a.i.k, the SQL grouping is only for query use in SQL, how can we leverage it for guiding cuboid selection? The cube building step happens before any query being submitted.

julianhyde commented 9 years ago

Ask yourself how much use Kylin makes of hierarchies. For years I thought they were a fundamental part of OLAP, and now I think they are just a hint for attributes that tend to be used together (e.g. if year is used, month is more likely to be used; if month is used, year is definitely also used).

If they are just hints, I wouldn't make them a fundamental part of the data model. But they are useful, so white-lists seems a good place for them.

SQL is not just for queries. It is a powerful language for defining data structures, because it maps onto relational algebra. Since the definition of a cuboid can be expressed equivalently as a SQL materialized view, why not represent it as a SQL materialized view?

liyang-gmt8 commented 9 years ago

The GROUP BY syntax is a very good candidate to define the whitelist of cuboids. It's concise and well known to many people, which is a big advantage.

One note is the current aggregation group and hierarchy is similar but not exactly the same as the RULLUP and CUBE. If both coexist, they may confuse users and complicate the system.

Ideally we keep only one syntax or a little bit of both world. Need to think more about it.

jiangxuchina commented 9 years ago

Basically, you want to express the conjunction (AND) and disjunction (OR) relationship between different dimension groups.

  1. aggregation group is conjunction of all groups. All groups must be join together.
  2. white list is the disjunction of all groups. One of groups should be ok.

Could you use the "conjunction group" and "disjunction group" to replace the "aggregation group" and "white list"?

Moreover, you can use "conjunction group" and "disjunction group" as internal data structure, and use SQL GROUP BY syntax as external expression. Then you can translate the external expression to internal data structure.

julianhyde commented 9 years ago

If a cube has N dimensions then there are 2^N possible cuboids -- the power set. Yes, it is a disjunction of conjunctions; Calcite uses a sorted set of bit-sets, which expresses the same thing compactly. Whether you use GROUPING SETS, CUBE, ROLLUP, or something else based on set theory, it will all boil down to a subset of the power set.