[Feature Request] SQL syntax for GENERATED columns in OSS

kpe commented 2 years ago

Bug

Using SQL it is not possible to CREATE or INSERT into a table with GENERATED columns with spark 3.2.1 and delta 1.2.0.

Describe the problem

Steps to reproduce

download latest spark-3.2.1-bin-hadoop3.2-scala2.13

start with

bin/spark-sql --packages io.delta:delta-core_2.13:1.2.0 --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog

try creating a table with a generated column:

CREATE TABLE `everts` (
event_time TIMESTAMP NOT NULL,
event_month DATE GENERATED ALWAYS AS (CAST(event_time AS DATE))
)
USING DELTA
;

try INSERT INTO
```
INSERT INTO TABLE default.events (event_time) 
VALUES (timestamp(now()))
;
```
Observed results

neither the CREATE nor the INSERT (for the case the table was created with DeltaTableBuilder) would succeed.

Expected results

Would be really great if the Delta OSS supports more of the feature announced by Databricks (otherwise it would help to communicate explicitly which features are Databricks exclusive and which are also available in the Delta OSS).

Further details

Could it be that the SQL GENERATED column syntax is supported on databricks only? Or am I missing something? I cannot find any 'GENERATED ALWAYS ...' related grammar in DeltaSqlBase.g4 either?

Environment information

Delta Lake version: 1.2.0
Spark version: 3.2.1
Scala version: 2.13.5

kpe commented 2 years ago

ok, I found this https://groups.google.com/g/delta-users/c/x2xnHkysIZw - CREATE and INSERT SQL syntax is not supported

vkorukanti commented 2 years ago

Thanks for creating this issue. As mentioned in the email group here, it requires upgrade to Spark 3.4.

kpe commented 2 years ago

Thanks @vkorukanti, I don't see a reference to Spark 3.4 in the above link. Building the current spark master branch (3.4.0-SNAPSHOT) doesn't seem to help either. I guess for the GENERATED column syntax to work, we'll first need to see a change in DeltaSqlParser.scala or SqlBaseParser.sql.

zsxwing commented 2 years ago

@kpe This is not in Spark yet. We are talking to the Spark community and hope we can get this syntax supported in Spark 3.4. It's a SQL standard so it should not be hard to add the parser support. And you are right. We need to add this to SqlBaseParser.g4.

vegarsti commented 2 years ago

@kpe This is not in Spark yet. We are talking to the Spark community and hope we can get this syntax supported in Spark 3.4. It's a SQL standard so it should not be hard to add the parser support. And you are right. We need to add this to SqlBaseParser.g4.

Thank you for doing that! Do you have a link to that community discussion? Very eager to get this in!

kpe commented 2 years ago

just as a side note - it is interesting to note, that Apache Iceberg has a similar/related feature "partitioning transform" which uses syntax like:

...
PARTITIONED BY (months(ts), ...)
...

with months() being a partition transform. (I guess, very often the GENERATED columns in delta will be used for defining a particular partitioning; in iceberg the syntax for this use case is less explicit, but seems to fit nicely - i.e. if you only need the GENERATED column to set the partitioning, you might not want it to be shown in a SELECT *; using a PARTITIONED BY (transform(column)) is also quite SQL-idiomatic - this might be worth considering, while discussing the SQL syntax changes with the Spark team).

tdas commented 2 years ago

@kpe Can you clarify something? are you saying that when iceberg uses PARTITIONED BY (transform(column)), then SELECT * does not show that generated partition column? Then how do users inspect that column value if something has gone wrong in that generation of its value?

In general, we designed Generated Column as the more general solution to such things that can be used on any column and any expressions, not just for a limited set of expressions and only for partitioning. This approach, along with the GENERATED ALWAYS AS syntax is actually SQL standard compliant (see Mysql, Postgres). APIs that adhere to long-standing SQL standards is usually more SQL-idiomatic than system-specific syntax. It ensures zero-learning curve for any SQL user who are already familiar with such standards when using other SQL engines. This will be our argument for accepting this syntax in the Spark. Finger's crossed that this works!

kpe commented 2 years ago

@tdas Thank you for the explanation! It all make sense, and I'm looking forward to use the GENERATED syntax in delta open source soon.

(without being an iceberg advocate or even a big data practitioner, I guess, iceberg's motivation around their partition handling, is to allow for queries without explicitly specified partitioning information, which could be redundant and error-prone, while allowing for partition schema evolution without breaking existing queries )

tdas commented 2 years ago

@kpe I see! Thank you for giving that additional information! Definitely food for thought for us.

kpe commented 1 year ago

can we expect the SQL syntax for GENERATED columns to become available in Delta 2.0? (now that delta is to be fully open sourced)

allisonport-db commented 1 year ago

Hey @kpe as mentioned earlier in this thread this requires changes in Spark. Making those necessary changes is on our roadmap but we will still have to wait for the next Spark release and upgrade our version.