delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.31k stars 1.64k forks source link

[Question][Uniform] Missing Column stats in Manifest File generated in Iceberg when Uniform is enabled #2258

Closed munendrasn closed 4 months ago

munendrasn commented 8 months ago

Question

Note: Not sure, if it is bug or feature gap. Hence, raising it was question.

While trying out Uniform found the Manifest file created in Iceberg, doesn't contain Column stats like lower_bound, upper_bound, null_counts. This would have impact on the query latency as column stats are used for pruning. Is Converting the stats from Delta files, and adding it to Iceberg Manifest file in works?

Which Delta project/connector is this regarding?

Environment information

Willingness to contribute

The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?

vkorukanti commented 8 months ago

@lzlfred @harperjiang @LukasRupprecht Could you please take a look at this?

scottsand-db commented 8 months ago

Hi @munendrasn, thanks for your question. In order to parse the stats for UniForm and Iceberg, we needed to make a change to Apache Spark (https://github.com/apache/spark/pull/42083). This was merged into master back in August, but we can only use it once it is actually released, in Apache Spark 3.6 or 4.0.

munendrasn commented 4 months ago

Closing the issue as the stats support has been added in this commit https://github.com/delta-io/delta/commit/3b3d729e931772339d58d200ef130d05cd39466d