apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.45k stars 2.23k forks source link

Feature Request: Support mergeSchema option when using Spark MERGE INTO #5556

Open kunal-nandwana opened 2 years ago

kunal-nandwana commented 2 years ago

Feature Request / Improvement

Hi Team, I am using Iceberg in my project and I found a big thing which is missing from Iceberg which is easily available in Apache Hudi and Deltalake that is "merge schema". If possible this feature need to added into the Iceberg. I am attaching my last ticket which is explaining the problem that I am facing.Please find the below ticket for the refrence.

5548

@rdblue any thoughts on this?

Query engine

Spark

kbendick commented 2 years ago

For reference, the issue here is that the user wants to be able to use mergeScherma option when writing via MERGE INTO.

I'm not sure of a way to support that presently. If somebody does know, please comment 🙂

kunal-nandwana commented 2 years ago

It would be great help if someone could help in achieving this functionality.. We are struggling to do this thing manually...

kbendick commented 2 years ago

Just an FYI but I would update the title to be Feature Request: Support mergeSchema option when using Spark MERGE INTO. This is more explicit and gets to the heart of what it is you need.

The hints might not be something we can add without changing Spark, but the core of the idea is that you need mergeSchema to work with MERGE INTO (which is currently SQL only).

Removing the implementation constraint from the title might attract more eyeballs / bring more ideas to the table (as ultimately you don’t care about anything other than needing mergeInto to work).

kbendick commented 2 years ago

Also, what about a table property? Does the table experience writes where you explicitly do not want mergeSchema?

Generally, I think mergeSchem is safer as a per-query option and is somewhat unsafe as a table level configuration. But Spark makes it somehat hard to support that as there’s no Dataframe support for MERGE INTO currently.

For the long run, I’m going to bring up adding a merge into API to the dataframe / dataset API in Spark. But that could take a while. We might be able to provide implicit classes so that it’s do-able using the dataframe API in just Iceberg, but in the long run that should be moved to Spark (though that doesn’t solve your immediate problem, I know).

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] commented 1 year ago

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

jhchee commented 1 year ago

Not stale

jaredtbates commented 1 year ago

This is definitely something we'd be interested in - We are doing something similar with Glue and would like to be able to support schema evolution with a MERGE INTO UPSERT. Currently we have to manually modify the iceberg schema every time our source schema changes.

This is a very similar architecture to what we're doing - https://aws.amazon.com/blogs/big-data/automate-replication-of-relational-sources-into-a-transactional-data-lake-with-apache-iceberg-and-aws-glue/

amogh-jahagirdar commented 1 year ago

Reopening due to interest in this

andreacfm commented 1 year ago

@kbendick is this still on your radar? If not could you give me some direction on where I could start to look at.

FabricioZGalvani commented 1 year ago

Any updates on this feature? I also have a strong interest in iceberg providing this solution.

RussellSpitzer commented 1 year ago

Anyone who would like to work on the issue is welcome to, there is currently no one I know working on it.

andreacfm commented 1 year ago

Delta Lake has the ability to set spark.databricks.delta.schema.autoMerge.enabled. I find this approach interesting as it can be used only when required. Once set the automatic schema evolution works for every write operation.

abhishekkrbaliase commented 8 months ago

Is it still being worked on? It would be nice if we can have either:

  1. Schema evolution (schema merge) for merge sql statements or 2 DataFrame API for merge queries
umiddey commented 2 months ago

Man if they added this, Iceberg would be used even more. This is the biggest pain point of Iceberg and I hope this is sorted because I don't see hard it could be, compared to what is already done by Iceberg, to add this feature.

nickdelnano commented 2 weeks ago

With the code for MERGE being moved to Spark with Spark 3.5 & Iceberg 1.5.0, is it even possible to support this feature solely with changes in Iceberg, or are changes in Spark required?