Open sopel39 opened 2 months ago
This probably should be part of https://github.com/apache/iceberg/issues/8450
cc @raunaqmorarka @lxynov
@sopel39 adding to the above this can we can also store null_counts. see more detailed discussion here. Null counts which are stored in the partition stats can be scaled during run time (or otherwise on the fly collection can be used).
+1 on this. Min/max values are needed by CBO to estimate the selectivity of range filters.
+1
Proposed Change
At the moment https://iceberg.apache.org/spec/#partition-statistics doesn't contain min/max stats per column. Because of that engines (e.g: https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsReader.java#L158) need to read manifests files to compute min/max stats per column. Keeping min/max stats at partition level would allow to save time on enumerating manifest files during planning. This is especially important with high concurrency queries and on large scale tables.
Proposal document
No response
Specifications