apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.5k stars 2.25k forks source link

Spark : Derive Stats From Manifest on the Fly #11615

Open saitharun15 opened 4 days ago

saitharun15 commented 4 days ago

This PR helps to derives min,max,numOfNulls Statistics on the fly from manifest files to report back them to Spark.

Currently only Ndv is calculated and reported back to Spark Engine, which leads to inaccurate plans in Spark side since min,max,nullCount are returned as NULL

As there is a discussion still going on whether to store stats partition level or table level, even if we calculate them in either ways there would be an issue as per this comment in discussion #10791

These changes helps to enable the onFly collection of the stats using a table property or a session conf(by default it's false)

cc @guykhazma @jeesou

saitharun15 commented 4 days ago

Hi, @huaxingao @karuppayya @aokolnychyi @RussellSpitzer Can you help review this PR

saitharun15 commented 3 days ago

@RussellSpitzer, thanks for the review comments,I will address them soon. As per @huaxingao implementation here , aggregate pushdown is skipped when row level deletes are detected, I have applied a similar change here as well.