Closed ArijitSinghEDA closed 1 hour ago
There has been a long standing issue in OSS Spark to have an API for passing through write options during SparkSQL. I believe it is tackled in Spark 4.0.X I ThinK @szehon-ho may know more
Thank you @RussellSpitzer!
I saw the PR made by @szehon-ho
It is what I was looking for, but is it only limited to INSERT
commands, or does it work with UPDATE
and DELETE
commands as well?
My concern is that Iceberg tracks snapshots for any changes made to any table, so I wish to track this according to the user even if they run UPDATE
or DELETE
command
UPDATE: I saw the other PRs made by @szehon-ho on the Spark GitHub, and they are like the exact things I am looking for. It will be an incredible upgrade when it will be ready.
Should we close this @ArijitSinghEDA since it will be fixed in Spark 4.X?
Query engine
Spark
Question
I am using Iceberg with PostgreSQL as catalog, MinIO as data storage and using Spark for interacting with Iceberg. My application can take multiple users working on the same table at the same time using Spark SQL. Now, when any updates are made to this table (INSERT, UPDATE, and/or DELETE), a snapshot is created, tracking these changes. Now, I wish to keep track of which user makes what changes, and for that I wish to add the property userid to the snapshot properties which are accessible through the summary column of the snapshot table.
With INSERT command, I can convert it to a Spark SQL DataFrame, and run the following to add userid to the snapshot properties
But, I cannot perform the same task while performing UPDATE or DELETE command.
I tried to use branching as well, but when we try to merge the data back to the main branch, the issue is that there should be an ID column for merging data back, and if there are too many rows in the branched table, then merging back will be a long and resource intensive task.
Is there any way I can pass the snapshot-property.userid as part of Spark SQL query? I searched online, Iceberg's and Spark's documentations, as well as ChatGPT, but nowhere was any solution regarding this.
If there is any better way than this, then also I am all ears.