Extends Iceberg table stats API to allow publish data and stats atomically

findepi commented 1 year ago

Feature Request / Improvement

Currently UpdateStatistics (org.apache.iceberg.Transaction#updateStatistics) allows adding statistics for an existing snapshot. As a result, it is currently not possible publish a snapshot with statistics already collected.

Collecting statistics for an existing data is definitely an important use-case (like Trino's ANALYZE), but some query engines (like Trino) can collect stats on the fly, when writing to a table (INSERT, CREATE TABLE AS ...).

It's not difficult to

publish data change snapshot (adding new files)
take a note of new snapshot ID
add statistics for that snapshot

however this has some drawbacks

new data is published without stats, so other queries can be planned sub-optimally, leading to eg improper use of cluster resources, or even unexpected query failures (if data changed significantly)
someone may run ANALYZE on the new snapshot (unknowingly or intentionally), and this will end up with two different threads wanting to add stats to it -- wasted work

We should make it possible to publish data change together with new stats. This may will require API changes It may also require spec changes, if we want to use "inherit snapshot ID" model. (Maybe we don't have to, since stats are in metadata?)

Query engine

None

findepi commented 1 year ago

cc @rdblue

ajantha-bhat commented 1 year ago

but some query engines (like Trino) can collect stats on the fly, when writing to a table (INSERT, CREATE TABLE AS ...).

I think we have discussed this for partitions stats too. @rdblue mentioned we cannot have writers to write stats on the fly (with insert, CTAS, update), because it needs bumping the Iceberg spec to V3 as some writers will write stats and some writer will not write stats and it can cause inconsistency.

we agreed on using ANALYZE syntax or CALL procedure for generating stats until V3 format is ready.

RussellSpitzer commented 1 year ago

@findepi shouldn't you be able to just change any write only commit into a transaction with both updates the append and updates the statistics?

Like

AppendFiles(A, B, C)

becomes

Transaction Begin
  AppendFiles(A, B, C)
  Update Statistics (A, B ,C)
Transaction End
Commit Transaction // Creates one Snapshot which both appends files and updates statistics

Then it's up to the framework to build those transactions when required. This would be similar to the mergeSchema functions in Spark.

findepi commented 1 year ago

Update Statistics API requires to pass a snapshot ID. @RussellSpitzer Is the snapshot ID known before transaction commits?

RussellSpitzer commented 1 year ago

Hmm that is probably not possible, but I guess that's were we should modify the api? We do know the snapshot ID before we actually do the commit, so we should be able to just fill it in.

findepi commented 1 year ago

Good idea!

cc @rdblue @ajantha-bhat

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

findepi commented 1 year ago

it's remains needed by Trino

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

findepi commented 2 months ago

it's remains needed by Trino.

cc @alexjo2144 @findinpath

apache / iceberg

Extends Iceberg table stats API to allow publish data and stats atomically #6442

Feature Request / Improvement

Query engine