Open ajantha-bhat opened 2 months ago
Hi @ajantha-bhat this is a widely required feature as it greatly affects the performance when spark executes queries on large scale data. Do we have any work in progress to support this feature ?
Hi @ShyamalaGowri: The progress of the feature can be tracked from https://github.com/apache/iceberg/issues/8450.
Spark query planner has to adopt these stats. I am not sure if the spark CBO is matured enough to use this. Dremio, Trino engines (which extensively uses CBO) will definitely make use of this feature.
Feature Request / Improvement
Based on the experiments from https://github.com/apache/iceberg/pull/9437, spark action is not effective as the serialization cost of each partition stats entry is expensive. Need a table API in the core module to compute stats in a distributed way.
But we still need a SQL way to compute the partition stats. Hence, we will be calling the core API via SQL call procedure.
Query engine
None