apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.49k stars 2.24k forks source link

User ID information in Iceberg Table's snapshot #11474

Closed ArijitSinghEDA closed 1 hour ago

ArijitSinghEDA commented 2 weeks ago

Query engine

Spark

Question

I am using Iceberg with PostgreSQL as catalog, MinIO as data storage and using Spark for interacting with Iceberg. My application can take multiple users working on the same table at the same time using Spark SQL. Now, when any updates are made to this table (INSERT, UPDATE, and/or DELETE), a snapshot is created, tracking these changes. Now, I wish to keep track of which user makes what changes, and for that I wish to add the property userid to the snapshot properties which are accessible through the summary column of the snapshot table.

With INSERT command, I can convert it to a Spark SQL DataFrame, and run the following to add userid to the snapshot properties

# Assuming df is the dataframe created for INSERT command

df.write.mode("append").option("snapshot-property.userid", 123).insertInto("iceberg.ns.tbl")

But, I cannot perform the same task while performing UPDATE or DELETE command.


I tried to use branching as well, but when we try to merge the data back to the main branch, the issue is that there should be an ID column for merging data back, and if there are too many rows in the branched table, then merging back will be a long and resource intensive task.


Is there any way I can pass the snapshot-property.userid as part of Spark SQL query? I searched online, Iceberg's and Spark's documentations, as well as ChatGPT, but nowhere was any solution regarding this.

If there is any better way than this, then also I am all ears.

RussellSpitzer commented 2 weeks ago

There has been a long standing issue in OSS Spark to have an API for passing through write options during SparkSQL. I believe it is tackled in Spark 4.0.X I ThinK @szehon-ho may know more

ArijitSinghEDA commented 2 weeks ago

Thank you @RussellSpitzer! I saw the PR made by @szehon-ho It is what I was looking for, but is it only limited to INSERT commands, or does it work with UPDATE and DELETE commands as well?

My concern is that Iceberg tracks snapshots for any changes made to any table, so I wish to track this according to the user even if they run UPDATE or DELETE command


UPDATE: I saw the other PRs made by @szehon-ho on the Spark GitHub, and they are like the exact things I am looking for. It will be an incredible upgrade when it will be ready.

RussellSpitzer commented 11 hours ago

Should we close this @ArijitSinghEDA since it will be fixed in Spark 4.X?