delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.62k stars 1.71k forks source link

[Bug] Output parsed stats for delta lake tables #2000

Open sclmn opened 1 year ago

sclmn commented 1 year ago

Bug

Currently, if delta.checkpoint.writeStatsAsStruct is set to true, the output contains parsed partition values but does not include parsed stats.

I think the code includes just parsed partition values right now but no support for parsed stats is present.

Would it be possible to add stats_parsed?

Motivation

The protocol states:

stats_parsed: The stats can be stored in their original format. This field needs to be written when statistics are available and the table property: delta.checkpoint.writeStatsAsStruct is set to true. When this property is set to false (which is the default), this field should be omitted from the checkpoint.

scottsand-db commented 1 year ago

@sclmn are you saying that when delta.checkpoint.writeStatsAsStruct is true, delta-spark is not writing out the stats_parsed field in the delta checkpoint? That seems like a bug. Thanks for pointing this out!

scottsand-db commented 1 year ago

@prakharjain09 can you take a look?

sclmn commented 1 year ago

Hi, I just wanted to check whether you have an update?

prakharjain09 commented 1 year ago

I checked this and this seems like a bug.

felipepessoto commented 2 months ago

@scottsand-db, @prakharjain09, @sclmn, do you have any updates on this?

felipepessoto commented 2 weeks ago

Related: #1719

Tom-Newton commented 1 day ago

+1 from me. I think writing stats_parsed would be very useful.