Netflix / iceberg

Iceberg is a table format for large, slow-moving tabular data
Apache License 2.0
472 stars 59 forks source link

Snapshot-Level Metrics and Statistics #102

Open omervk opened 5 years ago

omervk commented 5 years ago

Assume a table with the following field:

id int

There are n data files, each of which has the statistics min(id) and max(id). ids are positive integers.

Querying by id < 0 would require an O(n) run on all data files in the manifest, querying whether min(id) < 0 < max(id).

Aggregating metrics and/or statistics to the Snapshot level would reduce such scans from O(n) (n being the number of data files) to O(1).