delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.21k stars 394 forks source link

Concurrent checkpoint creation leads to corrupt delta table for delta-rs readers #2643

Open Werepyrex10 opened 3 months ago

Werepyrex10 commented 3 months ago

I have the following setup:

When both processes decide to create a checkpoint on the same version, there is no failure on writing, since the notebook does a multi-part checkpoint, while the delta-rs process does a single-part checkpoint. Here is a preview of the result of both operations, as seen in the delta logs: image

After this occurs, when trying to open the table with the delta-rs lib, we get the following error:

Failed to create delta ops: MetadataError("Number of checkpoint files '3' is not equal to number of checkpoint metadata parts 'None'")

This is because of the way the library counts the number of parts https://github.com/delta-io/delta-rs/blob/rust-v0.17.3/crates/core/src/kernel/snapshot/log_segment.rs#L447-L452

Should the library ignore the multi-part files if the _last_checkpoint file does not have any parts specified ?

djouallah commented 3 months ago

I don't thinking using multiple delta writers in the same table is a good idea, the whole ecosystem is not mature enough, just use one writer for everything.

ion-elgreco commented 3 months ago

@Werepyrex10 I suggest you disable the checkpointing in delta-spark or delta-rs for now.

rtyler commented 2 months ago

@Werepyrex10 What storage backend is this? If it's S3, is the Databricks cluster using the same S3DynamoDbLogStore configuration as the delta-rs process?

Werepyrex10 commented 2 months ago

Hey @rtyler , we are using azure blob storage as the storage backend

mriccardi89 commented 1 month ago

We had the same issue, or you config a Dynamo for the delta log or you use only 1 writer. We ended up with the 1 writer solution.