Closed norbitek closed 3 months ago
Hi @norbitek thanks for opening this issue. This is definitely in the plan for Delta Lake but we're currently prioritizing other features on the roadmap #920 like OPTIMIZE ZORDER
and CDF.
@norbitek, it's on the roadmap for 2022 H2 🥳 https://github.com/delta-io/delta/issues/1307
Tried to add a generated column using SQL. So I understand it is not supported yet in pyspark?
@wedesoft Spark doesn't support it yet. The sql syntax supported for GENERATED COLUMN is tracked by #1100
Is this still on the roadmap?
Any news on this issue status?
Any update on release date ?
This is definitely still on the roadmap! However, at the moment all the focus is on completing Deletion Vectors, which is in high demand. We will only get to this item after that work is complete.
Since Delta Lake 3.1.0 (with deletion vectors) is out now, would you consider working on it for 3.2, @bart-samwel 😇
@keen85
Since Delta Lake 3.1.0 (with deletion vectors) is out now, would you consider working on it for 3.2
Thank you for the reminder! It is near the top of our list now. I can't make any hard guarantees, but I'm hopeful that we'll get to this pretty soon.
@bart-samwel What is the reason that features in Standalone version are implemented with such big latency? Does it means that for every new features (like for example liquid clustering) we will wait for about 2 years?
@norbitek
What is the reason that features in Standalone version are implemented with such big latency?
Just to make sure there's no confusion here: Delta Standalone is different from the Spark connector for of Delta Lake. Standalone is a library that can be used to implement connectors for non-Spark systems, and it is not really getting the new features anymore -- its design is not really suitable to support many of the new features easily. All of the new efforts are going into Delta Kernel, which is the new library for building connectors. It makes it a lot easier to keep up with new features, and we intend to keep it up to date.
Identity columns is a feature where we have unfortunately dropped the ball even for support in the Spark connector. It's the exception though, not the rule!
Does it means that for every new features (like for example liquid clustering) we will wait for about 2 years?
Certainly not! Like I said, identity columns is an exception. Liquid clustering is actually released in Delta Lake 3.1 which came out last week! https://github.com/delta-io/delta/releases
Hi, currently in my company, I'm not using Spark SQL anywhere. Here I wanted to utilize DeltaTableBuilderAPI. So wanted to ask whether is this resolved, if no, when will we get this update?
Many thanks, Yogesh S
@SYOGESH045 The next release of Delta is going to be Delta 3.3. The identity column support seems to be in progress - https://github.com/delta-io/delta/pull/3044. So Delta 3.3 should have it. If I have to hazard a guess, Delta 3.3 should be released in 2-3 months.
Last version of Databricks added support for identity column in Delta table. It is possible to define GENERATED ALWAYS AS IDENTITY in column specification.
It would be nice to do the same using DeltaTableBuilder for example:
DeltaTable.create(spark) \ .tableName("default.people10m") \ .addColumn("id", "BIGINT", generatedAlwaysAs="IDENTITY(START WITH 10 INCREMENT BY 10)") \ .addColumn("firstName", "STRING") \ .addColumn("middleName", "STRING") \ .addColumn("lastName", "STRING", comment = "surname") \ .addColumn("gender", "STRING") \ .addColumn("birthDate", "TIMESTAMP") \ .addColumn("dateOfBirth", DateType(), generatedAlwaysAs="CAST(birthDate AS DATE)") \ .addColumn("ssn", "STRING") \ .addColumn("salary", "INT") \ .partitionedBy("gender") \ .execute()