databendlabs / databend

𝗗𝗮𝘁𝗮, 𝗔𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝘀 & 𝗔𝗜. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com
https://docs.databend.com
Other
7.8k stars 742 forks source link

RFC(new grammar): copy/insert ... with transform #10398

Closed youngsofun closed 1 year ago

youngsofun commented 1 year ago

Summary

goal

solve https://github.com/datafuselabs/databend/issues/10173

propose

simple form (can only be used in copy, which can do schema infer):

Copy into table1 from stage1 with transform  t.c1, t.c2 + 1 from t 
--- or maybe simpler: 
Copy into table1 from stage1 with transform  c1,  c2 + 1 

standard form:

Copy into table1 from stage1 with transform  t.c1, t.c2 + 1 from t(c1 int, c2 string)

when we need schema:

  1. first of all, make the schema verbose is beneficial: the SQL is clear, easier to maintain. it serves as a doc/comment of the data.
  2. for streaming load or clickhouse insert, can not do schema infer.
  3. schema infer depend on the file chosen to be inferred, but data may be bad.
  4. schema of some file type is hard to infer
    1. for csv/tsv
      1. user can only use column name $1, $2, which is disaster when there is a lot of columns
      2. schema infer that always safe is all columns string. while with a schema, the string can be deserialized into a dest column,not only convenient, but also much more efficient(deserialize while read is faster than read into utf8strings and cast to some map), compare with transform t.c1, t.c2 + 1 from t(c1 int8) with with transform t.c1::int8 from t(c1 int8)
    2. for ndjson, schema inferthat always safe isall columns variant. even if we risk to infer, for a column withobjectas value, it can be a map or a variant, andstring` can map to many TYPE like CSV/TSV, e.g. "timestamp"/"decimal"...
    3. even for parquet, columns like variant are not mapped directly https://github.com/datafuselabs/databend/issues/10272

we can provide syntactic sugars for this

Copy into table1 from stage1 with transform  t.c1, t.c2 + 1 from t like table2
Copy into table1 from stage1 with transform  t.c1, t.c2 + 1 from t STRINGS(n)  --- all column is string,  used for CSV/TSV

note:

  1. if the ON_ERROR is skip by file (not supported yet), expr after transform can not contain aggr
youngsofun commented 1 year ago

cc @sundy-li @Xuanwo @BohuTANG

sundy-li commented 1 year ago

Maybe it's better to add a doc in RFC directory, it's better to comment.

sundy-li commented 1 year ago
Copy into table1 from stage1 with transform  c1,  c2 + 1 

vs

Copy into table1 from (select c1,  c2 + 1  from stage1)

I prefer the latter one.

youngsofun commented 1 year ago
Copy into table1 from stage1 with transform  c1,  c2 + 1 

vs

Copy into table1 from (select c1,  c2 + 1  from stage1)

I prefer the latter one.

no place for data schema and harder to impl on_error

youngsofun commented 1 year ago

I prefer a verbose data schema schema infer has many limitations, is only convenient for play with data, not good for production.

or

Copy into table1 from (select c1,  c2 + 1  from stage1 with schema (c1 int8, c2 string))

good point is it extends select from stage too.

but not as friendly as with transform to use on_error directly , and check/record the files

BohuTANG commented 1 year ago

I prefer a verbose data schema infer has many limitations, is only convenient for play with data, not good for production.

Not prefer to do that, because the REPLACE/MERGE data source may also use the transform statement, it's complex for us, more nature to use select from stage statement, and only support parquet is good enough for now.

MERGE INTO target_table FROM (SELECT c1, c1+1 from state) CASE WHEN ...