apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
5.86k stars 2.06k forks source link

Spark procedure to compute partition stats. #10106

Open ajantha-bhat opened 2 months ago

ajantha-bhat commented 2 months ago

Feature Request / Improvement

Based on the experiments from https://github.com/apache/iceberg/pull/9437, spark action is not effective as the serialization cost of each partition stats entry is expensive. Need a table API in the core module to compute stats in a distributed way.

But we still need a SQL way to compute the partition stats. Hence, we will be calling the core API via SQL call procedure.

Query engine

None

ShyamalaGowri commented 2 months ago

Hi @ajantha-bhat this is a widely required feature as it greatly affects the performance when spark executes queries on large scale data. Do we have any work in progress to support this feature ?

ajantha-bhat commented 2 months ago

Hi @ShyamalaGowri: The progress of the feature can be tracked from https://github.com/apache/iceberg/issues/8450.

Spark query planner has to adopt these stats. I am not sure if the spark CBO is matured enough to use this. Dremio, Trino engines (which extensively uses CBO) will definitely make use of this feature.