delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.62k stars 1.71k forks source link

Add support for GENERATED ALWAYS AS IDENTITY in DeltaTableBuilder #1072

Closed norbitek closed 3 months ago

norbitek commented 2 years ago

Last version of Databricks added support for identity column in Delta table. It is possible to define GENERATED ALWAYS AS IDENTITY in column specification.

It would be nice to do the same using DeltaTableBuilder for example:

DeltaTable.create(spark) \ .tableName("default.people10m") \ .addColumn("id", "BIGINT", generatedAlwaysAs="IDENTITY(START WITH 10 INCREMENT BY 10)") \ .addColumn("firstName", "STRING") \ .addColumn("middleName", "STRING") \ .addColumn("lastName", "STRING", comment = "surname") \ .addColumn("gender", "STRING") \ .addColumn("birthDate", "TIMESTAMP") \ .addColumn("dateOfBirth", DateType(), generatedAlwaysAs="CAST(birthDate AS DATE)") \ .addColumn("ssn", "STRING") \ .addColumn("salary", "INT") \ .partitionedBy("gender") \ .execute()

allisonport-db commented 2 years ago

Hi @norbitek thanks for opening this issue. This is definitely in the plan for Delta Lake but we're currently prioritizing other features on the roadmap #920 like OPTIMIZE ZORDER and CDF.

keen85 commented 2 years ago

@norbitek, it's on the roadmap for 2022 H2 🥳 https://github.com/delta-io/delta/issues/1307

wedesoft commented 2 years ago

Tried to add a generated column using SQL. So I understand it is not supported yet in pyspark?

generated

zsxwing commented 2 years ago

@wedesoft Spark doesn't support it yet. The sql syntax supported for GENERATED COLUMN is tracked by #1100

jasperp97 commented 1 year ago

Is this still on the roadmap?

thebaz73 commented 1 year ago

Any news on this issue status?

shahkalpan07 commented 1 year ago

Any update on release date ?

bart-samwel commented 1 year ago

This is definitely still on the roadmap! However, at the moment all the focus is on completing Deletion Vectors, which is in high demand. We will only get to this item after that work is complete.

keen85 commented 9 months ago

Since Delta Lake 3.1.0 (with deletion vectors) is out now, would you consider working on it for 3.2, @bart-samwel 😇

bart-samwel commented 9 months ago

@keen85

Since Delta Lake 3.1.0 (with deletion vectors) is out now, would you consider working on it for 3.2

Thank you for the reminder! It is near the top of our list now. I can't make any hard guarantees, but I'm hopeful that we'll get to this pretty soon.

norbitek commented 9 months ago

@bart-samwel What is the reason that features in Standalone version are implemented with such big latency? Does it means that for every new features (like for example liquid clustering) we will wait for about 2 years?

bart-samwel commented 9 months ago

@norbitek

What is the reason that features in Standalone version are implemented with such big latency?

Just to make sure there's no confusion here: Delta Standalone is different from the Spark connector for of Delta Lake. Standalone is a library that can be used to implement connectors for non-Spark systems, and it is not really getting the new features anymore -- its design is not really suitable to support many of the new features easily. All of the new efforts are going into Delta Kernel, which is the new library for building connectors. It makes it a lot easier to keep up with new features, and we intend to keep it up to date.

Identity columns is a feature where we have unfortunately dropped the ball even for support in the Spark connector. It's the exception though, not the rule!

Does it means that for every new features (like for example liquid clustering) we will wait for about 2 years?

Certainly not! Like I said, identity columns is an exception. Liquid clustering is actually released in Delta Lake 3.1 which came out last week! https://github.com/delta-io/delta/releases

SYOGESH045 commented 5 months ago

Hi, currently in my company, I'm not using Spark SQL anywhere. Here I wanted to utilize DeltaTableBuilderAPI. So wanted to ask whether is this resolved, if no, when will we get this update?

Many thanks, Yogesh S

tdas commented 5 months ago

@SYOGESH045 The next release of Delta is going to be Delta 3.3. The identity column support seems to be in progress - https://github.com/delta-io/delta/pull/3044. So Delta 3.3 should have it. If I have to hazard a guess, Delta 3.3 should be released in 2-3 months.