delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.62k stars 1.71k forks source link

[PROTOCOL] Per-file statistics documentation in protocol is ambiguous #3287

Open TaylorHodan opened 5 months ago

TaylorHodan commented 5 months ago

Bug

Describe the problem

Per-file Statistics in protocol specification is a bit ambiguous and could provide more detail regarding the availability and format of per-file statistics for columns of specific data types. For instance, columns of array data type do not specify whether min and max statistics should be provided (and whether they are seems to be at the discretion of the engine being used). For instance, using DBR 15.2 with Spark 3.5.0 and Scala 2.12, the max and min statistics for arrays (nested or otherwise) are not provided, only the nullCount. Further, for nested arrays, only the nullCount of the nested array itself, not any of its fields, is given using the same settings as above. The _delta_log was generated by creating a Spark DataFrame and then df.write.format("delta").mode("append").save(storagePathway).

Moreover, the format of min and max statistics for type DateTime seems also to be up to the discretion of the engine. For example, whether HH:MM:SS are included in the min and max statistics or whether the min and max statistics are truncated to their "short" form of YYYY-MM-DD. Using DBR 15.2 with Spark 3.5.0 and Scala 2.12, the statistics were truncated to YYYY-MM-DD in the _delta_log.

The only example included in this section of the protocol specification seems to show how the min and max values in stats would be formatted, no further details regarding how these stats should be formatted for their respective data types.

Willingness to contribute

The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?

vkorukanti commented 5 months ago

Thanks for reporting this. Some of the info is in the Delta docs and in code around the configs that influence the stats collections. There is some info missing. We will be adding them soon to spec.

LukasRupprecht commented 5 months ago

@vkorukanti I would be interested in covering this. If that's ok, please assign the ticket to me. Thanks!