Spark procedure to compute partition stats.

apache / iceberg

Apache Iceberg

https://iceberg.apache.org/

Apache License 2.0

6.49k stars 2.24k forks source link

Spark procedure to compute partition stats. #10106

Open ajantha-bhat opened 7 months ago

ajantha-bhat commented 7 months ago

Feature Request / Improvement

Based on the experiments from https://github.com/apache/iceberg/pull/9437, spark action is not effective as the serialization cost of each partition stats entry is expensive. Need a table API in the core module to compute stats in a distributed way.

But we still need a SQL way to compute the partition stats. Hence, we will be calling the core API via SQL call procedure.

Query engine

None

ShyamalaGowri commented 7 months ago

Hi @ajantha-bhat this is a widely required feature as it greatly affects the performance when spark executes queries on large scale data. Do we have any work in progress to support this feature ?

ajantha-bhat commented 7 months ago

Hi @ShyamalaGowri: The progress of the feature can be tracked from https://github.com/apache/iceberg/issues/8450.

Spark query planner has to adopt these stats. I am not sure if the spark CBO is matured enough to use this. Dremio, Trino engines (which extensively uses CBO) will definitely make use of this feature.

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.