Open rtyler opened 1 year ago
cc @ryan-johnson-databricks
AFAICT this is a bug. The Delta spec is very clear on this point:
implementations are free to store any valid JSON-formatted data via the commitInfo action.
That said, we probably need to update the Delta spec to be more clear: The commit info, if present, should be a JSON object (rather than a primitive or an array). Today it just says "any valid JSON-formatted data" which is ambiguous.
That way, parsers like spark know what to expect, since the six JSON data types all correspond to different spark types, and the spark read schema must choose just one since they are mutually incompatible:
JSON type | Spark type |
---|---|
string | StringType |
number | DoubleType |
object | MapType |
array | ArrayType |
boolean | BooleanType |
null | NullType |
Bug
Describe the problem
Based on this comment from delta-io/delta-rs#1017 it would appear that the delta-spark implementation does not adhere to the protocol document with regards to the
commitInfo
field:Steps to reproduce
See linked comment above
Observed results
Expected results
I would expect that the
commitInfo
with any value JSON data to be read correctly.Further details
Environment information
Willingness to contribute
The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?