TatianaJin / husky-sql

Other
3 stars 5 forks source link

Implementing a Husky adapter and codegen for TableScan #3

Open TatianaJin opened 5 years ago

TatianaJin commented 5 years ago

As a practice for better understanding the query optimization and execution pipeline, let us start with supporting simple queries like SELECT * FROM table, where the table is a file stored in HDFS and to be read by Husky.

A rough guideline is as follows.

1. Provide schema and adapter

There are three major parts that you need:

  1. model.json: specify the schema in your database, for this practice one schema with one table is enough.
  2. Table: store the necessary information (e.g. the table url) for the table to read, and provide the column types through GetRowType. Hints: you may specify the columns in the model.json through "columns". To handle more dynamic cases, you may try using JNI or any method you think appropriate to dynamically load the column types from the file (e.g. stored in ORC, Parquet file header).
  3. TableFactory: create the table according to the given operands & row data type.

2. Optimize query and translate optimized plan into json

In the HuskyQueryPlan example we show how the optimization is carried out step by step. You can get the final plan (currently it is a plan in husky logical convention), and translate it into a json file (you can use your creativity to determine how to organize the json).

Alternatively, if you are more comfortable with using JNI to execute Husky tasks, you may use JNI instead of translating a plan into json.

You may try three cases (ordered from easiest to hardest):

  1. SELECT * from table: you will get a plan with only one RelNode: HuskyLogicalNativeTableScan. You can think about how to use the info in this RelNode to get the corresponding table (especially the url)
  2. Add project SELECT col1, col2 from table: think about what info is needed to implement the project function in Husky given the table. You may also implement a rule to push project into the table scan (the rule name is given in the Husky rulesets).
  3. Add filter SELECT col1, col2 from table where col3=val: you may start with a equality filter in the format of expression=literal. Think about what information is needed and how to get it from the corresponding RelNode.

For each group, either you use JNI or json for codegen, please reply here to specify your solution e.g. if you use json, please specify the structure of your json.

3. Provide execution logic in Husky

Provide your implementation for table scan, project, and filter. If json is used for codegen part, write a main program to read the json and execute the sql logic accordingly.

For each group, please reply here to specify your interface for the operators

lmatz commented 5 years ago

I am not sure whether I understand it correctly: if JNI is used, isn’t the data being constantly copied from c++ to java and back to execute one function after another?

TatianaJin commented 5 years ago

I am not sure whether I understand it correctly: if JNI is used, isn’t the data being constantly copied from c++ to java and back to execute one function after another?

Not necessarily, by using JNI, we can pass the Relnode tree to c++ and let Husky parse the tree and do execution.

lmatz commented 5 years ago

naruhodo! Didn’t expect to transfer Relnode tree

TatianaJin commented 5 years ago

To get the project columns and filter conditions: use HuskyLogicalCal::getProgram() to get a RexProgram, and use getCondition(), getProjectLists(), and getExprList() to get the conditions, projections and the expression list.

lisy45 commented 5 years ago

Able to generate JSON file according to the optimized plan now. Currently the output JSON file will only include information of version,url,projection and condition.

TatianaJin commented 5 years ago

Able to generate JSON file according to the optimized plan now. Currently the output JSON file will only include information of version,url,projection and condition.

Could you specify the json structure or give an example?

TatianaJin commented 5 years ago

@Alice3150 @WuZihao1 @Sleepy-Neko @lisy45 Hi all, I hope you can read all others pr #4 #5 #6 and comment. For this issue, it looks like Siyao's team is more on generating the json and implementing on husky, while Alice is more on the optimization rules. Please have discussions on the IMPORTANT interface: json -> husky and let us decide the design (not the final version, but at least sufficient for this issue) by next week(? if you have any concern about the time, tell me).