Closed mattslack-db closed 2 months ago
Hi @mattslack-db thanks for reporting this issue. Does a specification exist for the expected format of operationMetrics? Testing integration with Spark can be bit annoying so we would test against that.
@Blajda this is part of the commitInfo, which is a free format according to the protocol. It's what we discussed last week in the call
@mattslack-db we are actually following the protocol but spark-delta is not since they assume a dtype. We are looking at aligning this and making some of the assumptions of what the expected schema is into the protocol
@ion-elgreco agreed, according to the protocol, "Implementations are free to store any valid JSON-formatted data via the commitInfo action.", so although Databricks can make assumptions about the schema of commitInfo
(in this case, that operationMetrics
is a map<string, string>
) it shouldn't fall over if that schema is not followed.
The commit info or rather having arbitrarily nested data in the actions is quite a nuisance, So while the protocol gives us the freedom (for now) I believe delta-rs should also avoid this and just dump json payloads as strings where applicable.
I've been noticing something similar. After I run a compact() command over my delta lake stored in s3. I'm unable to run a glue crawler on it to produce the metadata catalog (it gives an Internal service exception - without any further debug info).
At first I thought it's a transient error with AWS but now it seems like it is associated with this bug.
Do we have any plans to solve this? Also would using the apache spark optimize solve this?
The same thing happens with filesRemoved
.
I even verified my last comment by manually editing the fields - filesAdded
and filesRemoved
as strings instead of json. AWS glue was successful in crawling them.
It's not just limited to glue and databricks, rather a lot of other query engines I work with expect the same thing and everything got resolved because of @mattslack-db one comment. Thank you so much!
Please patch this :)
Environment
Delta-rs version: rust-v0.16.5 Python deltalake 0.15.1
Binding: Python binding, but believe not specific to Python as it is just a wrapper over this method
Environment:
Bug
What happened: Running
dt.optimize.compact()
creates an entry in the delta log which is not readable by Databricks Runtime.Example:
"commitInfo":{"timestamp":1705439516034,"operation":"OPTIMIZE","operationParameters":{"targetSize":"104857600"},"operationMetrics":{"filesAdded":{"avg":50424.6,"max":50517,"min":50335,"totalFiles":5,"totalSize":252123},"filesRemoved":{"avg":14010.447154471545,"max":15213,"min":13942,"totalFiles":123,"totalSize":1723285},"numBatches":123,"numFilesAdded":5,"numFilesRemoved":123,"partitionsOptimized":5,"preserveInsertionOrder":true,"totalConsideredFiles":123,"totalFilesSkipped":0},"clientVersion":"
What you expected to happen: Databricks Runtime (and Delta Sharing) expect filesAdded to be a string
How to reproduce it: This can be reproduced in Databricks by running
dt.optimize.compact()
against a Delta table after some appends and then runningDESCRIBE HISTORY
against the same table.More details: