delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.52k stars 1.69k forks source link

[BUG] commitInfo fields format's are not treated as optional #1548

Open rtyler opened 1 year ago

rtyler commented 1 year ago

Bug

Describe the problem

Based on this comment from delta-io/delta-rs#1017 it would appear that the delta-spark implementation does not adhere to the protocol document with regards to the commitInfo field:

Implementations are free to store any valid JSON-formatted data via the commitInfo action.

Steps to reproduce

See linked comment above

Observed results

Expected results

I would expect that the commitInfo with any value JSON data to be read correctly.

Further details

Environment information

Willingness to contribute

The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?

scottsand-db commented 1 year ago

cc @ryan-johnson-databricks

ryan-johnson-databricks commented 1 year ago

AFAICT this is a bug. The Delta spec is very clear on this point:

implementations are free to store any valid JSON-formatted data via the commitInfo action.

ryan-johnson-databricks commented 1 year ago

That said, we probably need to update the Delta spec to be more clear: The commit info, if present, should be a JSON object (rather than a primitive or an array). Today it just says "any valid JSON-formatted data" which is ambiguous.

That way, parsers like spark know what to expect, since the six JSON data types all correspond to different spark types, and the spark read schema must choose just one since they are mutually incompatible:

JSON type Spark type
string StringType
number DoubleType
object MapType
array ArrayType
boolean BooleanType
null NullType