delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.17k stars 1.62k forks source link

[Feature Request] SQL syntax for GENERATED columns in OSS #1100

Open kpe opened 2 years ago

kpe commented 2 years ago

Bug

Using SQL it is not possible to CREATE or INSERT into a table with GENERATED columns with spark 3.2.1 and delta 1.2.0.

Describe the problem

Steps to reproduce

  1. download latest spark-3.2.1-bin-hadoop3.2-scala2.13

  2. start with

    bin/spark-sql --packages io.delta:delta-core_2.13:1.2.0 --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
  3. try creating a table with a generated column:

    CREATE TABLE `everts` (
    event_time TIMESTAMP NOT NULL,
    event_month DATE GENERATED ALWAYS AS (CAST(event_time AS DATE))
    )
    USING DELTA
    ;
  4. try INSERT INTO

    INSERT INTO TABLE default.events (event_time) 
    VALUES (timestamp(now()))
    ;

    Observed results

    neither the CREATE nor the INSERT (for the case the table was created with DeltaTableBuilder) would succeed.

    Expected results

    Would be really great if the Delta OSS supports more of the feature announced by Databricks (otherwise it would help to communicate explicitly which features are Databricks exclusive and which are also available in the Delta OSS).

    Further details

Could it be that the SQL GENERATED column syntax is supported on databricks only? Or am I missing something? I cannot find any 'GENERATED ALWAYS ...' related grammar in DeltaSqlBase.g4 either?

Environment information

kpe commented 2 years ago

ok, I found this https://groups.google.com/g/delta-users/c/x2xnHkysIZw - CREATE and INSERT SQL syntax is not supported

vkorukanti commented 2 years ago

Thanks for creating this issue. As mentioned in the email group here, it requires upgrade to Spark 3.4.

kpe commented 2 years ago

Thanks @vkorukanti, I don't see a reference to Spark 3.4 in the above link. Building the current spark master branch (3.4.0-SNAPSHOT) doesn't seem to help either. I guess for the GENERATED column syntax to work, we'll first need to see a change in DeltaSqlParser.scala or SqlBaseParser.sql.

zsxwing commented 2 years ago

@kpe This is not in Spark yet. We are talking to the Spark community and hope we can get this syntax supported in Spark 3.4. It's a SQL standard so it should not be hard to add the parser support. And you are right. We need to add this to SqlBaseParser.g4.

vegarsti commented 2 years ago

@kpe This is not in Spark yet. We are talking to the Spark community and hope we can get this syntax supported in Spark 3.4. It's a SQL standard so it should not be hard to add the parser support. And you are right. We need to add this to SqlBaseParser.g4.

Thank you for doing that! Do you have a link to that community discussion? Very eager to get this in!

kpe commented 2 years ago

just as a side note - it is interesting to note, that Apache Iceberg has a similar/related feature "partitioning transform" which uses syntax like:

...
PARTITIONED BY (months(ts), ...)
...

with months() being a partition transform. (I guess, very often the GENERATED columns in delta will be used for defining a particular partitioning; in iceberg the syntax for this use case is less explicit, but seems to fit nicely - i.e. if you only need the GENERATED column to set the partitioning, you might not want it to be shown in a SELECT *; using a PARTITIONED BY (transform(column)) is also quite SQL-idiomatic - this might be worth considering, while discussing the SQL syntax changes with the Spark team).

tdas commented 2 years ago

@kpe Can you clarify something? are you saying that when iceberg uses PARTITIONED BY (transform(column)), then SELECT * does not show that generated partition column? Then how do users inspect that column value if something has gone wrong in that generation of its value?

In general, we designed Generated Column as the more general solution to such things that can be used on any column and any expressions, not just for a limited set of expressions and only for partitioning. This approach, along with the GENERATED ALWAYS AS syntax is actually SQL standard compliant (see Mysql, Postgres). APIs that adhere to long-standing SQL standards is usually more SQL-idiomatic than system-specific syntax. It ensures zero-learning curve for any SQL user who are already familiar with such standards when using other SQL engines. This will be our argument for accepting this syntax in the Spark. Finger's crossed that this works!

kpe commented 2 years ago

@tdas Thank you for the explanation! It all make sense, and I'm looking forward to use the GENERATED syntax in delta open source soon.

(without being an iceberg advocate or even a big data practitioner, I guess, iceberg's motivation around their partition handling, is to allow for queries without explicitly specified partitioning information, which could be redundant and error-prone, while allowing for partition schema evolution without breaking existing queries )

tdas commented 2 years ago

@kpe I see! Thank you for giving that additional information! Definitely food for thought for us.

kpe commented 1 year ago

can we expect the SQL syntax for GENERATED columns to become available in Delta 2.0? (now that delta is to be fully open sourced)

allisonport-db commented 1 year ago

Hey @kpe as mentioned earlier in this thread this requires changes in Spark. Making those necessary changes is on our roadmap but we will still have to wait for the next Spark release and upgrade our version.

keen85 commented 1 year ago

hi @allisonport-db & @zsxwing any updates concerning including this into Spark 3.4?

allisonport-db commented 1 year ago

Working to get this into 3.4 tracking via SPARK-41290

jaceklaskowski commented 1 year ago

SPARK-41290 closed 🔥

felipepessoto commented 1 year ago

SPARK-41290 also contains changes required to https://github.com/delta-io/delta/issues/1072?

zsxwing commented 1 year ago

SPARK-41290 also contains changes required to #1072?

No. It doesn't add the identity column syntax.

kevinclcn commented 1 year ago

spark 3.4 supports generated always as syntax, but delta 2.4 doesn't support it yet, when will it get support?

VeeraswamyGatta commented 7 months ago

Any update on this issue, In latest Delta 3.0 also this issue not fixed.