apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.42k stars 2.22k forks source link

Extends Iceberg table stats API to allow publish data and stats atomically #6442

Open findepi opened 1 year ago

findepi commented 1 year ago

Feature Request / Improvement

Currently UpdateStatistics (org.apache.iceberg.Transaction#updateStatistics) allows adding statistics for an existing snapshot. As a result, it is currently not possible publish a snapshot with statistics already collected.

Collecting statistics for an existing data is definitely an important use-case (like Trino's ANALYZE), but some query engines (like Trino) can collect stats on the fly, when writing to a table (INSERT, CREATE TABLE AS ...).

It's not difficult to

however this has some drawbacks

We should make it possible to publish data change together with new stats. This may will require API changes It may also require spec changes, if we want to use "inherit snapshot ID" model. (Maybe we don't have to, since stats are in metadata?)

Query engine

None

findepi commented 1 year ago

cc @rdblue

ajantha-bhat commented 1 year ago

but some query engines (like Trino) can collect stats on the fly, when writing to a table (INSERT, CREATE TABLE AS ...).

I think we have discussed this for partitions stats too. @rdblue mentioned we cannot have writers to write stats on the fly (with insert, CTAS, update), because it needs bumping the Iceberg spec to V3 as some writers will write stats and some writer will not write stats and it can cause inconsistency.

we agreed on using ANALYZE syntax or CALL procedure for generating stats until V3 format is ready.

RussellSpitzer commented 1 year ago

@findepi shouldn't you be able to just change any write only commit into a transaction with both updates the append and updates the statistics?

Like

AppendFiles(A, B, C)

becomes

Transaction Begin
  AppendFiles(A, B, C)
  Update Statistics (A, B ,C)
Transaction End
Commit Transaction // Creates one Snapshot which both appends files and updates statistics

Then it's up to the framework to build those transactions when required. This would be similar to the mergeSchema functions in Spark.

findepi commented 1 year ago

Update Statistics API requires to pass a snapshot ID. @RussellSpitzer Is the snapshot ID known before transaction commits?

RussellSpitzer commented 1 year ago

Hmm that is probably not possible, but I guess that's were we should modify the api? We do know the snapshot ID before we actually do the commit, so we should be able to just fill it in.

findepi commented 1 year ago

Good idea!

cc @rdblue @ajantha-bhat

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

findepi commented 1 year ago

it's remains needed by Trino

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

findepi commented 2 months ago

it's remains needed by Trino.

cc @alexjo2144 @findinpath