apache / shardingsphere

Empowering Data Intelligence with Distributed SQL for Sharding, Scalability, and Security Across All Databases.
Apache License 2.0
19.99k stars 6.75k forks source link

Support parsing Doris sql #31466

Open iamhucong opened 5 months ago

iamhucong commented 5 months ago

Background

Hi community.\ The ShardingSphere SQL parser engine helps users to parse SQL to create the AST (Abstract Syntax Tree) and visit the AST to get SQLStatement (Java Object). Currently, we are planning to enhance the support for Doris SQL parsing in ShardingSphere.

More details: https://shardingsphere.apache.org/document/current/en/reference/sharding/parse/

Issue Background Explanation

The current issue involves using a custom web scraping script to fetch SQL cases from the official website of a corresponding database. These SQL cases are then passed to ShardingSphere's parsing engine for analysis. For SQL cases that fail to be parsed successfully, every 3 to 5 SQL cases are grouped together as an issue.

  1. Since SQL cases are obtained through web scraping, it cannot be guaranteed that all SQL cases are correct. Please follow the following process to handle this pull request (PR).
  2. Some SQL cases may have already been fixed in other PRs. For cases that can already be executed successfully, simply leave a comment to ignore them.
  3. If a SQL case can be executed successfully without any code changes, there is no need to add a corresponding test assert file.

Task

This issue is to support more Doris sql parsing, as follows:

REFRESH CATALOG hive
REFRESH DATABASE ctl.database1
REFRESH DATABASE database1
REFRESH TABLE ctl.db.table1
REFRESH TABLE db.table1
REFRESH TABLE table1
BIGINT size(ARRAY<T> arr) BIGINT array_size(ARRAY<T> arr) BIGINT cardinality(ARRAY<T> arr)
Syntax:     curl --location-trusted -u user:passwd -XPOST http://host:port/api/{db}/_multi_start?label=xxx     curl --location-trusted -u user:passwd -T data.file http://host:port/api/{db}/{table1}/_load?label=xxx\&sub_label=yyy     curl --location-trusted -u user:passwd -T data.file http://host:port/api/{db}/{table2}/_load?label=xxx\&sub_label=zzz     curl --location-trusted -u user:passwd -XPOST http://host:port/api/{db}/_multi_commit?label=xxx     curl --location-trusted -u user:passwd -XPOST http://host:port/api/{db}/_multi_desc?label=xxx On the basis of 'MINI LOAD', 'MULTI LOAD' can support users to import to multiple tables at the same time. The specific commands are shown above. '/api/{db}/_multi_start' starts a multi-table import task '/api/{db}/{table}/_load' adds a table to be imported to an import task. The main difference from 'MINI LOAD' is that the 'sub_label' parameter needs to be passed in '/api/{db}/_multi_commit' submits the entire multi-table import task, and starts processing in the background '/api/{db}/_multi_abort' Abort a multi-table import task '/api/{db}/_multi_desc' can display the number of jobs submitted by a multi-table import task Description of the HTTP protocol     Authorization Authentication Currently, Doris uses HTTP Basic authorization authentication. So you need to specify the username and password when importing                         This method is to pass the password in clear text, since we are currently in an intranet environment. . .     Expect Doris needs to send the http request, it needs to have 'Expect' header information, the content is '100-continue'                         why? Because we need to redirect the request, before transmitting the data content,                         This can avoid causing multiple transmissions of data, thereby improving efficiency.     Content-Length Doris needs to send the request with the 'Content-Length' header. If the content sent is greater than                         If the 'Content-Length' is less, then Palo thinks that there is a problem with the transmission, and fails to submit the task.                         NOTE: If more data is sent than 'Content-Length', then Doris only reads 'Content-Length'                         length content and import Parameter Description:     user: If the user is in the default_cluster, the user is the user_name. Otherwise user_name@cluster_name.     label: Used to specify the label number imported in this batch, which is used for later job status query, etc.                         This parameter is required.     sub_label: Used to specify the subversion number inside a multi-table import task. For loads imported from multiple tables, this parameter must be passed in.     columns: used to describe the corresponding column names in the import file.                         If it is not passed in, then the order of the columns in the file is considered to be the same as the order in which the table was created.                         The specified method is comma-separated, for example: columns=k1,k2,k3,k4     column_separator: used to specify the separator between columns, the default is '\t'                         NOTE: url encoding is required, for example, '\t' needs to be specified as the delimiter,                         Then you should pass in 'column_separator=%09'     max_filter_ratio: used to specify the maximum ratio of non-standard data allowed to filter, the default is 0, no filtering is allowed                         The custom specification should be as follows: 'max_filter_ratio=0.2', which means 20% error rate is allowed                         Passing in has effect when '_multi_start' NOTE:     1. This import method currently completes the import work on one machine, so it is not suitable for import work with a large amount of data.     It is recommended that the amount of imported data should not exceed 1GB     2. Currently it is not possible to submit multiple files using `curl -T "{file1, file2}"`, because curl splits them into multiple files     The request is sent. Multiple requests cannot share a label number, so it cannot be used.     3. Supports the use of curl to import data into Doris in a way similar to streaming, but only after the streaming ends Doris     The real import behavior will occur, and the amount of data in this way cannot be too large.
MULTI, MINI, LOAD
CREATE TABLE test1.table2 LIKE test1.table1

Overall Procedure

If you intend to participate in fixing this issue, please feel free to leave a comment below the issue. Community members will assign the issue accordingly.

For example, you can leave a comment like this: "Hi, please assign this issue to me. Thank you!"

Once you have claimed the issue, please review the syntax of the SQL on the official website of the corresponding database. Execute the SQL on the respective database to ensure the correctness of the SQL syntax.

You can check the corresponding source of each SQL case on the official database website by clicking on the link provided below each case.

Next, execute the problematic SQL cases mentioned above in the database (you can quickly start the corresponding database using the Docker image for that database, and then connect to it using a client you are familiar with), to ensure that the SQL syntax itself is correct.

Fixing ANTLR Grammar Parsing Issue

Once you have confirmed the correctness of the SQL syntax, you can validate and fix the grammar parsing issue in ShardingSphere.

If you are using IntelliJ IDEA, you will need to install the ANTLR plugin before proceeding.

If it is an ANTLR parsing error message, try to repair the .g4 file by comparing it with the official database syntax until the SQL can be correctly parsed by ANTLR.

When there is no error message in the ANTLR Preview window, it means that ANTLR can correctly parse the SQL.

Visitor problem fix

After ANTLR parses SQL into an abstract syntax tree, ShardingSphere will access the abstract syntax tree through Visitor and extract the required information. If you need to extract Segments, you need to first execute:

 mvn -T 2C clean install -DskipTests

Under the shardingsphere-parser module to compile the entire parser module.\ Then rewrite the corresponding visit method in SQLStatementVisitorr as needed to extract the corresponding Segment.

Add assertion test file

After the above SQL parsing problem is repaired, the corresponding Test needs to be added. The steps are as follows:

  1. Add the corresponding sql-case in the sql/supported directory.
  2. Add case assertions in the case directory of the shardingsphere-test-it-parser module.
  3. Run org.apache.shardingsphere.test.it.sql.parser.internal.InternalSQLParserIT\ After SQL Parser IT runs successfully, you can submit a PR.

Relevant Skills

  1. Master JAVA language
  2. Have a basic understanding of Antlr g4 file
  3. Be familiar with Doris SQLs
github-actions[bot] commented 4 months ago

There hasn't been any activity on this issue recently, and in order to prioritize active issues, it will be marked as stale.

photon8231 commented 4 months ago

Hi, I'd like to resolve this issue, please assign it to me.