Closed jaogoy closed 1 year ago
@jaogoy I think you can paste it in Google doc and paste the link in this issue. And everyone can review on that google doc
I have put this article in Google docs, you can comment A new unified loading/unloading SQL syntax and design in detail for convenience.
We have marked this issue as stale because it has been inactive for 6 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to StarRocks!
Background
Purpose
Some Research before the design
Thanks to those engineers and designers already doing a lot of work in this area.
Design
There are three scenarios:
We refer to them accordingly as Local Loading, Loading, Continuous Loading.
Loading
Local Loading
There are two similar ways to load data from Local host.
A single curl post
We can load data from Local host by using a single cURL post with the loading SQL statement directly in the Header parameters.
Where the
sqlStatement
is roughly described as below (It should be flattened to a single line in the cURL command):\r,\n
could NOT be placed in the statement directly (We should use\x0d
,\x0a
instead).It's the recommended way when the copy statement is not too complicated.
Two steps curl post
Create a loading PIPE first using a similar cURL post:
"sql-type:FILE"
header parameter indicates that the file to be uploaded is a SQL file and not a data file as usualUpload data through the PIPE created above with another cURL post:
"sql-pipe:<pipe_name>"
header parameters indicate that this cURL post will use a loading PIPE already created before.Loading & COPY command
We can just call a COPY command to load data from Cloud data.
The COPY command has two main forms:
Mainly similar to Snowflake COPY syntax.
Enhance transformation & filter to support stronger and more flexible expressions. (will be detailed as follows)
Use
WITH FIELDS
to rename column names for convenience:WITH FIELDS('$.id' as id, '$.a' as ax)
to rename any JSON value with a JSON path likejsonpaths
as before.WITH FIELDS(a, b, c)
statement to rename CSV fields from 1 to n.Location indicates the location where data come from, such as a SOURCE, a PIPE, or HDFS/S3 files.
RESOURCE could be created as below, or just be unfolded in the COPY command:
Transformation & Filter
The syntax will be mainly as below:
show create table <table_name>
), it won't be too troublesome.select
supports the configuration of a range of fields like$1..$10
for easy writing and maintenance. At the same time, you can also write ordinary functions (such asarray([$2,$3,$5])
or some more complex functions) or a single column.WITH FIELDS(…)
, but it should be either empty or all listed.Some simple examples:
Continuous Loading
A StreamCopy seems like a wrapper of COPY command with a data source and some schedule parameters additionally. But, it will do a lot of work besides COPY.
state
(including offset) inside.max_batch_interval
,max_batch_size
,max_error_number
, etc.Example:
CREATE PIPE
For other situations in which that outer system will implement
connector + schedule + sink
, StarRocks will supply a PIPE definition:Unloading
COPY command
Similar to Loading, we can use a COPY command to copy data from the internal table to a specified location.
INTO location
clause can be unfolded into a more complicated form.Local Unloading
Some thoughts
Some command names could be changed to more meaningful words.
Does the
STREAM COPY
command sound appropriate? WillROUTINE COPY
,SINKER
, or others be more suitable?Shell we define a Stream object to store the state of the StreamCopy task?
Should we use STREAM instead of PIPE?
What's the difference between
COPY
andINSERT INTO SELECT
? Which one will be more in line with our intuition when we want to load data from an External Table, such as from S3 or Iceberg?I'm very glad to hear from you anything about this proposal.