Query Optimization and Convert Logical Plan to Physical Plan

MingyuGuan commented 5 years ago

Already Done:

Read table metadata from table.json and construct the table.
Basic adapter of Husky: HuskyLogicalTableScan(with project and filter).
Query Optimization Rules:
- PushProjectIntoTableScanRule:
  - matches operand(HuskyLogicalCalc.class, operand(HuskyLogicalTableScan.class, none()));
  - drop calc if the transformed program merely returns its input and doesn't exist filter.
- PushFilterIntoTableScanRule:
  - matches operand(HuskyLogicalCalc.class, operand(HuskyLogicalTableScan.class, none()));
  - drop calc if the condition has been push down into table scan and there is no projection;
  - Note: condition should be converted into CNF for convenience of physical plan.

Example: Suppose we have a query as following:

select TRANS_ID,ITEM_ID,PRICE,SELLER_ID,SELLER_NAME 
from Parts 
where PRICE > 2.0 and ITEM_ID = 2 or SELLER_ID = 4

Normalized Logical Plan:

LogicalProject(TRANS_ID=[$0], ITEM_ID=[$1], PRICE=[$3], SELLER_ID=[$4], SELLER_NAME=[$5])
  LogicalFilter(condition=[OR(AND(>($3, 2.0), =($1, 2)), =($4, 4))])
    LogicalTableScan(table=[[SALES, Parts]])

Apply PushProjectIntoTableScanRule and PushFilterIntoTableScanRule:

HuskyLogicalTableScan(table=[[SALES, Parts]], 
fields=[TRANS_ID=[$0], ITEM_ID=[$1], PRICE=[$3], SELLER_ID=[$4], SELLER_NAME=[$5]], 
condition=[AND(OR(>($2, 2.0), =($3, 4)), OR(=($1, 2), =($3, 4)))])

Note: In fields, FiledName=[$num] where num is the order of that field in the whole table; In condition, operand($num, constant) where num is the order of that field in the projected fieltes. For example, SELLER_ID is 4th (start with 0) field in the table while it is the 3rd field in the projected fields.

To Do List:
[ ] Add remaining optimization rules like: HuskyLogicalJoin, HuskyLogicalSort, etc.
[ ] Think about how to convert logical plan into physical plan
1. How to connect each RelNode?
2. What information should be contained in the output JSON file for each RelNode?
Here is a proposed example:

{ "name": "HuskyLogicalCalc", "type": "Calc", "project": [ { "index": "0", "name": "TRANS_ID", "datatype": "int" }, { "index": "1", "name": "ITEM_ID", "datatype": "int" }, ... ], "condition":{ { "operator": "AND", "left":{ "operator": "OR", "left": "PRICE", "right": "2.0" }, "right": { ... } } "input": { "name": "HuskyLogicalTableScan", "type": "TableScan", ... } } }

lmatz commented 5 years ago

How to connect each RelNode?

Just my 2 cents. RelNodes (or operators) which can be executed in one "list_execute" in Husky may be grouped together as a task. Then each boundary of two tasks is the point where the data needs to be shuffled. Thus, a shuffle operator may be defined. During the convertion from logical plans to physical plans, this shuffle operator is inserted whenver needed.

TatianaJin commented 5 years ago

@lmatz @zzxx-husky @ydwu4 What do you think about the optimization rules as implemented in #4 ?

@lisy45 @Sleepy-Neko @WuZihao1 Please read the rule part in pr #4 as well.

TatianaJin commented 5 years ago

@lisy45 @Sleepy-Neko @WuZihao1 What are your ideas about the JSON structure for optimized plans? Also how about the interface in Husky? You are supposed to reply to #3!!!

lmatz commented 5 years ago

I feel each pull request of adding new rules, new features, etc. may all accompany with its tests. It helps the reviewer to go through the code, and after all, it is a project with commits from multiple people. For rule tests, it can leverage the existing tools in Calcite's rule tests to create new ones.

lmatz commented 5 years ago

What are your ideas about the JSON structure for optimized plans?

Just my 2 cents. JSON file may be structured as a list of tasks. Each task contains one or more physical operators. Each physical operator may have its special structure.

If the task is a source which reads data from external data source, then it should record the data source in its JSON fields, such as table name and column names. A task reading data from other tasks should record its parents and the same to tasks writing data to other tasks.

Besides, JSON structure may have an array which lists all the data source tasks' names so that the developer of the Husky program can start visiting each task from these data source tasks.

Just an initial idea if some may feel hard to get started. It heavily depends on the people which are responsible for the Husky side development.

TatianaJin commented 5 years ago

I feel each pull request of adding new rules, new features, etc. may all accompany with its tests. It helps the reviewer to go through the code, and after all, it is a project with commits from multiple people. For rule tests, it can leverage the existing tools in Calcite's rule tests to create new ones.

@Alice3150 @WuZihao1 @Sleepy-Neko @lisy45 Please pay attention. You may learn how to write tests by reading Calcite's tests.

MingyuGuan commented 5 years ago

I feel each pull request of adding new rules, new features, etc. may all accompany with its tests. It helps the reviewer to go through the code, and after all, it is a project with commits from multiple people. For rule tests, it can leverage the existing tools in Calcite's rule tests to create new ones.

@Alice3150 @WuZihao1 @Sleepy-Neko @lisy45 Please pay attention. You may learn how to write tests by reading Calcite's tests.

Got it. Thanks.

TatianaJin / husky-sql

Query Optimization and Convert Logical Plan to Physical Plan #7