apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.16k stars 2.14k forks source link

Store min/max stats per column per partition #11083

Open sopel39 opened 1 week ago

sopel39 commented 1 week ago

Proposed Change

At the moment https://iceberg.apache.org/spec/#partition-statistics doesn't contain min/max stats per column. Because of that engines (e.g: https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsReader.java#L158) need to read manifests files to compute min/max stats per column. Keeping min/max stats at partition level would allow to save time on enumerating manifest files during planning. This is especially important with high concurrency queries and on large scale tables.

Proposal document

No response

Specifications

sopel39 commented 1 week ago

This probably should be part of https://github.com/apache/iceberg/issues/8450

cc @raunaqmorarka @lxynov

guykhazma commented 1 week ago

@sopel39 adding to the above this can we can also store null_counts. see more detailed discussion here. Null counts which are stored in the partition stats can be scaled during run time (or otherwise on the fly collection can be used).

lxynov commented 1 week ago

+1 on this. Min/max values are needed by CBO to estimate the selectivity of range filters.