Open ajantha-bhat opened 7 months ago
Hi @ajantha-bhat this is a widely required feature as it greatly affects the performance when spark executes queries on large scale data. Do we have any work in progress to support this feature ?
Hi @ShyamalaGowri: The progress of the feature can be tracked from https://github.com/apache/iceberg/issues/8450.
Spark query planner has to adopt these stats. I am not sure if the spark CBO is matured enough to use this. Dremio, Trino engines (which extensively uses CBO) will definitely make use of this feature.
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
Feature Request / Improvement
Based on the experiments from https://github.com/apache/iceberg/pull/9437, spark action is not effective as the serialization cost of each partition stats entry is expensive. Need a table API in the core module to compute stats in a distributed way.
But we still need a SQL way to compute the partition stats. Hence, we will be calling the core API via SQL call procedure.
Query engine
None