delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.34k stars 413 forks source link

After dt.optimize.compact(), commitInfo.operationMetrics.filesAdded is a JSON map when other readers (e.g. Databricks) expect a string #2087

Closed mattslack-db closed 2 months ago

mattslack-db commented 10 months ago

Environment

Delta-rs version: rust-v0.16.5 Python deltalake 0.15.1

Binding: Python binding, but believe not specific to Python as it is just a wrapper over this method

Environment:


Bug

What happened: Running dt.optimize.compact() creates an entry in the delta log which is not readable by Databricks Runtime.

Example: "commitInfo":{"timestamp":1705439516034,"operation":"OPTIMIZE","operationParameters":{"targetSize":"104857600"},"operationMetrics":{"filesAdded":{"avg":50424.6,"max":50517,"min":50335,"totalFiles":5,"totalSize":252123},"filesRemoved":{"avg":14010.447154471545,"max":15213,"min":13942,"totalFiles":123,"totalSize":1723285},"numBatches":123,"numFilesAdded":5,"numFilesRemoved":123,"partitionsOptimized":5,"preserveInsertionOrder":true,"totalConsideredFiles":123,"totalFilesSkipped":0},"clientVersion":"

What you expected to happen: Databricks Runtime (and Delta Sharing) expect filesAdded to be a string

How to reproduce it: This can be reproduced in Databricks by running dt.optimize.compact() against a Delta table after some appends and then running DESCRIBE HISTORY against the same table.

More details:

Blajda commented 10 months ago

Hi @mattslack-db thanks for reporting this issue. Does a specification exist for the expected format of operationMetrics? Testing integration with Spark can be bit annoying so we would test against that.

ion-elgreco commented 10 months ago

@Blajda this is part of the commitInfo, which is a free format according to the protocol. It's what we discussed last week in the call

@mattslack-db we are actually following the protocol but spark-delta is not since they assume a dtype. We are looking at aligning this and making some of the assumptions of what the expected schema is into the protocol

mattslack-db commented 10 months ago

@ion-elgreco agreed, according to the protocol, "Implementations are free to store any valid JSON-formatted data via the commitInfo action.", so although Databricks can make assumptions about the schema of commitInfo (in this case, that operationMetrics is a map<string, string>) it shouldn't fall over if that schema is not followed.

roeap commented 10 months ago

The commit info or rather having arbitrarily nested data in the actions is quite a nuisance, So while the protocol gives us the freedom (for now) I believe delta-rs should also avoid this and just dump json payloads as strings where applicable.

mayanksingh2298 commented 9 months ago

I've been noticing something similar. After I run a compact() command over my delta lake stored in s3. I'm unable to run a glue crawler on it to produce the metadata catalog (it gives an Internal service exception - without any further debug info).

At first I thought it's a transient error with AWS but now it seems like it is associated with this bug.

Do we have any plans to solve this? Also would using the apache spark optimize solve this?

mayanksingh2298 commented 9 months ago

The same thing happens with filesRemoved.

I even verified my last comment by manually editing the fields - filesAdded and filesRemoved as strings instead of json. AWS glue was successful in crawling them.

It's not just limited to glue and databricks, rather a lot of other query engines I work with expect the same thing and everything got resolved because of @mattslack-db one comment. Thank you so much!

Please patch this :)