Closed szehon-ho closed 4 months ago
cc @aokolnychyi for any thoughts, potential issues
Actually taking a look through the code, this use case can be probably be solved by SnapshotUpdate.stageOnly() flag, taking a look.
I remember @rdblue @danielcweeks @Parth-Brahmbhatt mentioning similar attempts and that there were some issues but I don't recall any details.
We were considering enhancing HiveCatalog to point to a specific snapshot of an Iceberg table. This could be useful if we want to share a specific version of the table but still want to continue adding more data to the table, but a snapshot table might also solve this problem.
Yea , snapshotting an Iceberg table would be a great usability feature for sharing a certain snapshot of a table to others.
The stageOnly() API probably works for this but is not well known, and hard to expose through Spark/Hive to be shareable to other users.
One risk would be a user running snapshot on an Iceberg table, then dropping it thinking they are only dropping the snapshot.
If I'm not mistaken, dropping the table through the Sparkcatalog (purge=true) will drop all the current data of the original table (a general problem of the snapshot command).
Yeah this is a common case that I also see many people trying to achieve through Iceberg.
For exposing to Spark and Hive, would view help? We can create "snapshot view" for people to query, for example:
CREATE VIEW my_table_2020 AS SELECT * FROM my_table@1234567890;
Then dropping the view would not affect anything of the underlying snapshot.
If Iceberg's TableCatalog also supports ViewCatalog, we may be able to create this "snapshot" view on the fly.
@jackye1995: In Hive currently there is no way to parse my_table@1234567890
. I suspect that in Spark this would mean something like my_table
with snapshotId=1234567890
. Am I right?
@jzhuge: What is ViewCatalog
? Is it a Spark interface we should implement?
@pvary yeah sorry I am making up this syntax because something similar to this exists in delta lake. So it is definitely doable in Spark, but for Hive yes it is going to be hard to add extensions like this.
For this use case, we may want to consider adding git-like branches and tags instead. I think it would be cleaner to branch and then update the branch. Then you'd be able to stay within a table and reuse data files more cleanly. Sharing files across independent tables has a lot of problems.
That would also have a cleaner syntax: catalog.db.table.branch
. That would prevent us from using some branch names, like files
but I think overall it would be okay.
we may want to consider adding git-like branches and tags instead
@rdblue yes that would be ideal, but Nessie is trying to achieve this git-like experience. Currently it seems like people do want to continue using their catalog and also have that experience. Was there any discussion about this conflict of interest?
I've talked with @rymurr about the way that Nessie currently works and I think we generally agree that we would want to change it to use Iceberg-native branching and tagging.
The problem with Nessie's current model is that it keeps references to multiple metadata files instead of tracking everything in one place. That means:
We can fix those issues with Iceberg-native branching and tagging. I think that's the right option for use cases where you want to branch from current tables for testing purposes.
That would also have a cleaner syntax:
catalog.db.table.branch
. That would prevent us from using some branch names, likefiles
but I think overall it would be okay.
Seems like a nice way to expose staged-snapshots. We could expose metadata of this branch as well, like catalog.db.table.branch.files?
And I suppose Hive can support this easily as well then, if it's already able to parse metadata tables.
I also like the idea of branching / tagging tables.
And I suppose Hive can support this easily as well then, if it's already able to parse metadata tables.
AFAIK we do not have a way to expose table metadata ATM. I am still not sure what would be the best way to allow searching for snapshots etc.
If I remember correctly then Expedia had a way to create specific Hive tables where the schema was the SNAPSHOT_SCHEMA
and the content was the list of the snapshots, but that required the user to create a second table just to query the metadata. @massdosage might know more (but this could be a different topic)
I think we generally agree that we would want to change it to use Iceberg-native branching and tagging
That would be great! I also have a few requests on my side regarding this feature.
AFAIK we do not have a way to expose table metadata ATM
+1, creating an overlay for metadata table is a feasible workaround but is quite inconvenient. I am also interested in knowing if there is any good way to achieve that in Hive, but so far I don't see a way to do it without adding more hooks in Hive.
Hey all, sorry for being late to the party.
Ive asked @rdblue if we can discuss this in more detail at the next sync up, hopefully everyone can join for a brainstorm session. Would be great to make branches and tags first class citizens in Iceberg!
One thing I would like to solve is how to efficiently sync branches/tags across multiple tables. The git-like model in nessie makes this trivial as all tables in the catalog are included in a nessie 'commit'. I am not sure how we can do this directly in iceberg efficiently but I think it is important.
I've proposed an interface here #2304 which adds branch/tag support to catalogs and am currently experimenting w/ aligning nessie commits closer to iceberg snapshots to deal w/ the issue @rdblue described above.
Hi all, I'm trying to use Iceberg with Spark and SparkCatalog and I've a repro of this issue. Is there any workaround to work with snapshots?
Thank you.
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'
Had a use case to experiment on the side with a certain table snapshot (do some modifications), but didn't want to alter the table's history.
I think 'snapshot' command will be very useful here. We can quickly generate a separate table metadata pointing to the snapshot, instead of copying all the data into a side table for the experiments.
If Iceberg table is source, snapshot procedure can use latest table snapshot, or also potentially take a snapshotId as an argument.
Was chatting with @RussellSpitzer and the only potential problem is that if you expire original table's snapshot and remove orphan files, then the new table cannot not be able to be read. But it is the same problem as snapshotting a Hive table (dropping some files on original table will corrupt the new table).