delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.34k stars 414 forks source link

>0.17.0 _delta_log gets corrupted after overwrite (log files grows and grows upto 350mb per file) #3006

Closed TinoSM closed 31 minutes ago

TinoSM commented 4 days ago

Environment

Delta-rs version: 0.21.0 0.20.0 0.19.0 I can't test with 0.18.0 In 0.17.0 it works fine

Binding: Python

Environment:


Bug

What happened: When overwriting a table all the schema gets rewritten (already reported here https://github.com/delta-io/delta-rs/pull/2923) AND I think because of how json metadata is encoded/decoded, all \ characters get escaped again (these characters come from Spark comments/metadata for example, or my own comments)

One of my "development" tables json files grew to 350mb, now delta can't scan them anymore (thrift buffer size limits :) )

What you expected to happen:

When rewriting metadata, no extra escape characters should be added again

How to reproduce it:

I'm sorry but I can only test with polars :(

https://docs.pola.rs/api/python/stable/reference/api/polars.DataFrame.write_delta.html

import polars as pl

df = pl.DataFrame({
     "active": [1, 2, 3, 4, 5],
     "id": ["A", "B", "A", "B", "C"],
})

df.write_delta("./test_table", mode="overwrite")
df.write_delta("./test_table", mode="overwrite")
df.write_delta("./test_table", mode="overwrite")
df.write_delta("./test_table", mode="overwrite")
df.write_delta("./test_table", mode="overwrite")
df.write_delta("./test_table", mode="overwrite")
df.write_delta("./test_table", mode="overwrite")
#passing delta_write_options={"engine": "pyarrow"} fixes the issue

More details: test_table.zip contains the delta table with active+id columns, empty. test_table_broken.zip contains the tables with many \\\

Image with cat 00008.json and 0000.json, see how the \\ grew image

test_table_broken.zip test_table.zip

TinoSM commented 31 minutes ago

Fixed in 0.22