apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.48k stars 2.24k forks source link

Compute column stats incrementally #11475

Closed EremenkoValentin closed 1 week ago

EremenkoValentin commented 2 weeks ago

Query engine

Iceberg API

Question

Does Iceberg support incremental statistics calculation? How can this be done for columns? How do you calculate changes between two snapshots?

Hello everyone. I want to collect column statistics without reading the table every time. After examining the manifest files, I found that only statistics (value count, null count, NaN count, upper, lower) for changes made to a partition are stored.

As far as I understand, Puffin files allow storing NDV, but I couldn’t find information on how to use them. Can someone provide guidance or a link to documentation that contains the answers? Thanks all.

RussellSpitzer commented 1 week ago

Not yet