apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.19k stars 2.16k forks source link

Create Branches / TAGS between 2 snapshots #9281

Open fanaticjo opened 9 months ago

fanaticjo commented 9 months ago

Feature Request / Improvement

Is there a way where we can create a branch / TAG based on 2 snapshot ids or only latest data

We have a use case where we write monthly generated report to a iceberg table , but for every month we want to tag / branch the data for audit purposes .

Currently branch / tag creates the data from that snapshot to first snapshots .

if this task is possible , please let us know or if we can contribute also

Query engine

Spark

nastra commented 9 months ago

@fanaticjo can you elaborate on what you mean by 2 snapshot ids or only latest data?.

In short, branches/tags support the Auditing use case and you might want to take a look at the docs in https://iceberg.apache.org/docs/latest/branching/. I think what you're looking for is ALTER TABLE prod.db.sample CREATE TAGhistorical-tagAS OF VERSION <snapshot_id>. The snapshot_id in this case doesn't have to be the latest snapshot.

fanaticjo commented 9 months ago

i want to create a branch / tag only for the latest data load while AS of version considers the latest data and the previous data also .

For example

insert 1 , 2 ,3 --- snapshot id 1

if i create a branch with as of version 1 the branch will have 1 ,2 ,3

in next load insert 4 , 5 , 6 --- snapshot id 2

if i create a branch with as of version 2 the branch will have 1 , 2 , 3 ,4 ,5 ,6

what i want is only how i can create a branch for only 4 ,5 ,6

nastra commented 9 months ago

Are you saying you want to create a branch/tag and refer to a snapshot without its history? I don't think this is possible today. What would be the use case of not keeping the ancestor history or is there a particular concern that the ancestor history is kept?

fanaticjo commented 9 months ago

we just wanted using a tag/ branch to pull out the data written into that period only . i saw there is an incremental read available in in dataframe df = spark.read \ .format("iceberg") \ .option("start-snapshot-id", "360041659320668788") \ .option("end-snapshot-id", "9170237062650942416") \ .load("glue_catalog.playground.cash_report_iceberg")

is there an option this can be done through spark sql then also it would solve our requirement .