Open findepi opened 1 year ago
cc @rdblue
but some query engines (like Trino) can collect stats on the fly, when writing to a table (INSERT, CREATE TABLE AS ...).
I think we have discussed this for partitions stats too. @rdblue mentioned we cannot have writers to write stats on the fly (with insert, CTAS, update), because it needs bumping the Iceberg spec to V3 as some writers will write stats and some writer will not write stats and it can cause inconsistency.
we agreed on using ANALYZE syntax or CALL procedure for generating stats until V3 format is ready.
@findepi shouldn't you be able to just change any write only commit into a transaction with both updates the append and updates the statistics?
Like
AppendFiles(A, B, C)
becomes
Transaction Begin
AppendFiles(A, B, C)
Update Statistics (A, B ,C)
Transaction End
Commit Transaction // Creates one Snapshot which both appends files and updates statistics
Then it's up to the framework to build those transactions when required. This would be similar to the mergeSchema functions in Spark.
Update Statistics API requires to pass a snapshot ID. @RussellSpitzer Is the snapshot ID known before transaction commits?
Hmm that is probably not possible, but I guess that's were we should modify the api? We do know the snapshot ID before we actually do the commit, so we should be able to just fill it in.
Good idea!
cc @rdblue @ajantha-bhat
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
it's remains needed by Trino
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
it's remains needed by Trino.
cc @alexjo2144 @findinpath
Feature Request / Improvement
Currently
UpdateStatistics
(org.apache.iceberg.Transaction#updateStatistics
) allows adding statistics for an existing snapshot. As a result, it is currently not possible publish a snapshot with statistics already collected.Collecting statistics for an existing data is definitely an important use-case (like Trino's ANALYZE), but some query engines (like Trino) can collect stats on the fly, when writing to a table (INSERT, CREATE TABLE AS ...).
It's not difficult to
however this has some drawbacks
We should make it possible to publish data change together with new stats. This may will require API changes It may also require spec changes, if we want to use "inherit snapshot ID" model. (Maybe we don't have to, since stats are in metadata?)
Query engine
None