goal

solve https://github.com/datafuselabs/databend/issues/10173

can work with copy and streaming load
- for copy
  - can support on_error in COPY
  - can check and record the files copied.
easy to write/understand/maintain the SQL
efficient, no unnecessary additional overhead.

propose

simple form (can only be used in copy, which can do schema infer):

Copy into table1 from stage1 with transform  t.c1, t.c2 + 1 from t 
--- or maybe simpler: 
Copy into table1 from stage1 with transform  c1,  c2 + 1

the word transform is used in ETL, easy to understand.

standard form:

Copy into table1 from stage1 with transform  t.c1, t.c2 + 1 from t(c1 int, c2 string)

when we need schema:

first of all, make the schema verbose is beneficial: the SQL is clear, easier to maintain. it serves as a doc/comment of the data.
for streaming load or clickhouse insert, can not do schema infer.
schema infer depend on the file chosen to be inferred, but data may be bad.
schema of some file type is hard to infer
1. for csv/tsv
  1. user can only use column name $1, $2, which is disaster when there is a lot of columns
  2. schema infer that always safe is all columns string. while with a schema, the string can be deserialized into a dest column，not only convenient， but also much more efficient（deserialize while read is faster than read into utf8strings and cast to some map）, compare with transform t.c1, t.c2 + 1 from t(c1 int8) with with transform t.c1::int8 from t(c1 int8)
2. for ndjson, schema inferthat always safe isall columns variant. even if we risk to infer, for a column withobjectas value, it can be a map or a variant, andstring` can map to many TYPE like CSV/TSV, e.g. "timestamp"/"decimal"...
3. even for parquet, columns like variant are not mapped directly https://github.com/datafuselabs/databend/issues/10272

we can provide syntactic sugars for this

Copy into table1 from stage1 with transform  t.c1, t.c2 + 1 from t like table2
Copy into table1 from stage1 with transform  t.c1, t.c2 + 1 from t STRINGS(n)  --- all column is string,  used for CSV/TSV

note:

if the ON_ERROR is skip by file (not supported yet), expr after transform can not contain aggr

youngsofun commented 1 year ago

cc @sundy-li @Xuanwo @BohuTANG

sundy-li commented 1 year ago

Maybe it's better to add a doc in RFC directory, it's better to comment.

sundy-li commented 1 year ago

Copy into table1 from stage1 with transform  c1,  c2 + 1

vs

Copy into table1 from (select c1,  c2 + 1  from stage1)

I prefer the latter one.

youngsofun commented 1 year ago

Copy into table1 from stage1 with transform  c1,  c2 + 1

vs

Copy into table1 from (select c1,  c2 + 1  from stage1)

I prefer the latter one.

no place for data schema and harder to impl on_error

youngsofun commented 1 year ago

I prefer a verbose data schema schema infer has many limitations, is only convenient for play with data, not good for production.

or

Copy into table1 from (select c1,  c2 + 1  from stage1 with schema (c1 int8, c2 string))

good point is it extends select from stage too.

but not as friendly as with transform to use on_error directly , and check/record the files

BohuTANG commented 1 year ago

I prefer a verbose data schema infer has many limitations, is only convenient for play with data, not good for production.

Not prefer to do that, because the REPLACE/MERGE data source may also use the transform statement, it's complex for us, more nature to use select from stage statement, and only support parquet is good enough for now.

MERGE INTO target_table FROM (SELECT c1, c1+1 from state) CASE WHEN ...

databendlabs / databend

RFC(new grammar): copy/insert ... with transform #10398

goal

propose

when we need schema: