apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.23k stars 2.17k forks source link

Track uncompressed data size for column metrics #9966

Open ZacBlanco opened 6 months ago

ZacBlanco commented 6 months ago

Feature Request / Improvement

I have been working in the optimizer on Presto and was auditing the statistics returned by Iceberg for column sizes. Our table statistics implementation uses manifest file entries to calculate the min, max, null, data size, etc. I tested our statistics implementation and found that no matter how we sum the data sizes from iceberg manifests, they are 3-4x lower than expected. Below are some queries from a TPC-H table to show the sizes from our SHOW STATS statement and one using sum_data_size_for_stats - an aggregation function used when we call ANALYZE on tables.

SHOW STATS FOR (select comment from orders);
 column_name | data_size | distinct_values_count | nulls_fraction | row_count | low_value | high_value
-------------+-----------+-----------------------+----------------+-----------+-----------+------------
 comment     |  167815.0 | NULL                  |            0.0 | NULL      | NULL      | NULL
 NULL        | NULL      | NULL                  | NULL           |   15000.0 | NULL      | NULL
(2 rows)
presto:tpch> select sum_data_size_for_stats(comment) from orders;
 _col0
--------
 727364
(1 row)

The data size reported by the Iceberg metadata severely underestimates the true value. After reading the spec more closely, I realized that Iceberg returns the on-disk size of the data, including compression. In my opinion, I think we should also be able to store the "true" data size in the manifest

Most analytics systems (from my knowledge) won't spend their CPU time compressing data because it is going to be more time-efficient to just ship the data directly over the wire while the query is executing. At least in Presto, the SQL query optimizer takes data sizes into account when determining the physical plan for queries such as when data should be exchanged between nodes or the physical join implementation. I think the uncompressed size would be a more useful value to store in the data file manifest for the query optimizer.

I tested swapping the current parquet metric implementation to use the parquet footers' getTotalUncompressedSize method and found it gives much more accurate results - at least for Presto.

presto:tpch> SHOW STATS FOR (select comment from orders);
 column_name | data_size | distinct_values_count | nulls_fraction | row_count | low_value | high_value
-------------+-----------+-----------------------+----------------+-----------+-----------+------------
 comment     |  745598.0 | NULL                  |            0.0 | NULL      | NULL      | NULL
 NULL        | NULL      | NULL                  | NULL           |   15000.0 | NULL      | NULL
(2 rows)

------

presto:tpch> select sum_data_size_for_stats(comment) from orders;
 _col0
--------
 727364
(1 row)

Changing this would probably require a change in the spec. Also, while parquet has footer support for uncompressed size, I'm not sure if other file formats like ORC do. I am curious what the community thinks about the use cases for tracking compressed sizes and uncompressed sizes.

Query engine

Presto

ZacBlanco commented 6 months ago

This is related to #8274. We could achieve correct average size calculations by summing all data sizes and row counts from manifests, and divide the sum of bytes by the sum of rows to calculate the average size.