Open Aiden-Frost opened 1 week ago
When the writer threw an error, I expected the write to fail. When multiple writers are writing to the same location then only one should succeed and the other writes should fail and return error.
I think the information you are missing is that in a delta table, the writes happen in two stages: (1) write data files (Parquet), then (2) commit to the transaction log. It detects the table already exists and fails on step 2. The files created in step (1) that are part of the failed transaction can be cleaned up with the VACUUM operation.
Thank you for the clarification. I have gone through the transaction and vacuum documentation. Following the example when I try to vacuum:
dt = DeltaTable("s3a://test-bucket/file-4")
dt.vacuum()
[]
It returns an empty list. Is there something I am missing to indicate to clean up files that have failed transaction?
You should read the documentation for the vacuum method, particularly the retention_hours
and enforce_retention_duration
parameters.
https://delta-io.github.io/delta-rs/api/delta_table/#deltalake.DeltaTable.vacuum
After going through the documentation I have set the retention_hours=0
and set enforce_retention_duration=False
But even after this I still get empty list for vacuum.
There are totally 3 parquet files generated by the sample program in a path, I observed that the 00000000000000000000.json
file has one parquet file under the field value: add
, while the other two parquet files are under add in the tmp commit.
Environment
Delta-rs version: 0.18.1
Binding: rust
Environment:
Bug
What happened: I have writers as python processes, writing to the same file location. Each writer is responsible for creating a pandas dataframe and writing to the exact file location. There are 4 different scenarios regarding this writing:
When the process throws error 2.1 or 2.2, I expect the write to fail. But when inspecting the file I observe that the writer's data is appended to the file.
For the below reproducible example, let's consider process-1 and process-2 had the error 2.2. Now I only expected
process 3
to be present in the file but then the file containedWhat you expected to happen: When the writer threw an error, I expected the write to fail. When multiple writers are writing to the same location then only one should succeed and the other writes should fail and return error.
How to reproduce it:
More details:
Regarding the Generic error: A Delta Lake table already exists at that location, I believe this is handled in this part of the code in
crates/core/src/logstore/mod.rs
This function is called here
crates/core/src/operations/create.rs
:For the Delta table already exists, write mode set to error, this is handled from python side in
python/deltalake/writer.py