Allow snapshotting iceberg table (create new table based on certain Iceberg snapshot)

szehon-ho commented 3 years ago

Had a use case to experiment on the side with a certain table snapshot (do some modifications), but didn't want to alter the table's history.

I think 'snapshot' command will be very useful here. We can quickly generate a separate table metadata pointing to the snapshot, instead of copying all the data into a side table for the experiments.

If Iceberg table is source, snapshot procedure can use latest table snapshot, or also potentially take a snapshotId as an argument.

Was chatting with @RussellSpitzer and the only potential problem is that if you expire original table's snapshot and remove orphan files, then the new table cannot not be able to be read. But it is the same problem as snapshotting a Hive table (dropping some files on original table will corrupt the new table).

szehon-ho commented 3 years ago

cc @aokolnychyi for any thoughts, potential issues

szehon-ho commented 3 years ago

Actually taking a look through the code, this use case can be probably be solved by SnapshotUpdate.stageOnly() flag, taking a look.

aokolnychyi commented 3 years ago

I remember @rdblue @danielcweeks @Parth-Brahmbhatt mentioning similar attempts and that there were some issues but I don't recall any details.

pvary commented 3 years ago

We were considering enhancing HiveCatalog to point to a specific snapshot of an Iceberg table. This could be useful if we want to share a specific version of the table but still want to continue adding more data to the table, but a snapshot table might also solve this problem.

szehon-ho commented 3 years ago

Yea , snapshotting an Iceberg table would be a great usability feature for sharing a certain snapshot of a table to others.

The stageOnly() API probably works for this but is not well known, and hard to expose through Spark/Hive to be shareable to other users.

One risk would be a user running snapshot on an Iceberg table, then dropping it thinking they are only dropping the snapshot.
If I'm not mistaken, dropping the table through the Sparkcatalog (purge=true) will drop all the current data of the original table (a general problem of the snapshot command).

jackye1995 commented 3 years ago

Yeah this is a common case that I also see many people trying to achieve through Iceberg.

For exposing to Spark and Hive, would view help? We can create "snapshot view" for people to query, for example:

CREATE VIEW my_table_2020 AS SELECT * FROM my_table@1234567890;

Then dropping the view would not affect anything of the underlying snapshot.

jzhuge commented 3 years ago

If Iceberg's TableCatalog also supports ViewCatalog, we may be able to create this "snapshot" view on the fly.

pvary commented 3 years ago

@jackye1995: In Hive currently there is no way to parse my_table@1234567890. I suspect that in Spark this would mean something like my_table with snapshotId=1234567890. Am I right?

@jzhuge: What is ViewCatalog? Is it a Spark interface we should implement?

jackye1995 commented 3 years ago

@pvary yeah sorry I am making up this syntax because something similar to this exists in delta lake. So it is definitely doable in Spark, but for Hive yes it is going to be hard to add extensions like this.

rdblue commented 3 years ago

For this use case, we may want to consider adding git-like branches and tags instead. I think it would be cleaner to branch and then update the branch. Then you'd be able to stay within a table and reuse data files more cleanly. Sharing files across independent tables has a lot of problems.

That would also have a cleaner syntax: catalog.db.table.branch. That would prevent us from using some branch names, like files but I think overall it would be okay.

jackye1995 commented 3 years ago

we may want to consider adding git-like branches and tags instead

@rdblue yes that would be ideal, but Nessie is trying to achieve this git-like experience. Currently it seems like people do want to continue using their catalog and also have that experience. Was there any discussion about this conflict of interest?

rdblue commented 3 years ago

I've talked with @rymurr about the way that Nessie currently works and I think we generally agree that we would want to change it to use Iceberg-native branching and tagging.

The problem with Nessie's current model is that it keeps references to multiple metadata files instead of tracking everything in one place. That means:

We have to coordinate across metadata file versions even though Iceberg assumes that you don't do that: for example, that breaks the file cleanup assumptions because we compare the files that are reachable from all snapshots.
Changes that shouldn't be part of transactions may change between branches. For example, if you add a column in a branch and write data, you will have assigned a new ID and used it in a data file. If you did that in two branches in parallel, you'd use the same ID for two different columns. It may appear safe to merge the metadata trees, but it actually isn't because that would mix column data together.

We can fix those issues with Iceberg-native branching and tagging. I think that's the right option for use cases where you want to branch from current tables for testing purposes.

szehon-ho commented 3 years ago

That would also have a cleaner syntax: catalog.db.table.branch. That would prevent us from using some branch names, like files but I think overall it would be okay.

Seems like a nice way to expose staged-snapshots. We could expose metadata of this branch as well, like catalog.db.table.branch.files?

And I suppose Hive can support this easily as well then, if it's already able to parse metadata tables.

pvary commented 3 years ago

I also like the idea of branching / tagging tables.

And I suppose Hive can support this easily as well then, if it's already able to parse metadata tables.

AFAIK we do not have a way to expose table metadata ATM. I am still not sure what would be the best way to allow searching for snapshots etc.

If I remember correctly then Expedia had a way to create specific Hive tables where the schema was the SNAPSHOT_SCHEMA and the content was the list of the snapshots, but that required the user to create a second table just to query the metadata. @massdosage might know more (but this could be a different topic)

jackye1995 commented 3 years ago

I think we generally agree that we would want to change it to use Iceberg-native branching and tagging

That would be great! I also have a few requests on my side regarding this feature.

AFAIK we do not have a way to expose table metadata ATM

+1, creating an overlay for metadata table is a feasible workaround but is quite inconvenient. I am also interested in knowing if there is any good way to achieve that in Hive, but so far I don't see a way to do it without adding more hooks in Hive.

rymurr commented 3 years ago

Hey all, sorry for being late to the party.

Ive asked @rdblue if we can discuss this in more detail at the next sync up, hopefully everyone can join for a brainstorm session. Would be great to make branches and tags first class citizens in Iceberg!

One thing I would like to solve is how to efficiently sync branches/tags across multiple tables. The git-like model in nessie makes this trivial as all tables in the catalog are included in a nessie 'commit'. I am not sure how we can do this directly in iceberg efficiently but I think it is important.

I've proposed an interface here #2304 which adds branch/tag support to catalogs and am currently experimenting w/ aligning nessie commits closer to iceberg snapshots to deal w/ the issue @rdblue described above.

YehorKrivokon commented 10 months ago

Hi all, I'm trying to use Iceberg with Spark and SparkCatalog and I've a repro of this issue. Is there any workaround to work with snapshots?

Thank you.

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] commented 4 months ago

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

apache / iceberg

Allow snapshotting iceberg table (create new table based on certain Iceberg snapshot) #2481