delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.36k stars 416 forks source link

Can corrupt delta table commit file min/max values with binary column #2629

Open echai58 opened 5 months ago

echai58 commented 5 months ago

Environment

Delta-rs version: 0.18.1

Binding: python


Bug

What happened: I encountered a bizarre exception

Exception: Json error: Invalid UTF-16 surrogate pair 1406143

when trying to load one of my delta tables. After doing some investigation, I discovered some really weird behavior around how delta-rs encodes binary in the commit file json, specifically for the min/maxValues.

Essentially, the way I understand how a binary gets encoded as a string is as follows: any byte whose codepoint is < 128 gets encoded as-is, and any byte whose codepoint > 128 gets encoded as its unicode codepoint. e.g.

'hello world \xdd\x1c'

becomes

\"hello world \\u00dd\\u001c\"

However, when there happens to be a sequence of bytes in the binary that falls in the UTF-16 surrogate range, when delta-rs tries to decode the string, it sees the surrogate and tries to decode it as UTF-16, but since it's not actually valid UTF-16 surrogate, it fails.

For example:

from deltalake import DeltaTable, write_deltalake
import pandas as pd
import pyarrow as pa

bad_binary = b'hello \\uDd1c world \xdd\x1c'

df = pd.DataFrame.from_dict(
    {
        "p": [1],
        "k": [1],
        "v": [bad_binary],
    }
)

write_deltalake(
    "test-bad-binary",
    pa.table(df, schema=pa.schema([("p", pa.int64()), ("k", pa.int64()), ("v", pa.binary())])),
    partition_by="p",
    mode="append",
)

The write goes through, but now the DeltaTable is corrupt, and you cannot load the DeltaTable anymore. Inspecting the commit, we see

\"hello \\udd1c world \\u00dd\\u001c\"

and since \udd1c falls in the surrogate range, we see a json parse error about that.

Furthermore, the capitalized letter gets written as lowercase, which is just incorrect, but probably unrelated.

What you expected to happen: When I write the same data using spark, I see the commit file generated by spark does not write any min/max values for binary columns. I expect this to be the correct resolution for this bug.

abhiaagarwal commented 5 months ago

Can you set engine="rust" and see if the error replicates?

echai58 commented 5 months ago

Can you set engine="rust" and see if the error replicates?

@abhiaagarwal The error does not replicate with engine="rust". The commit file has the following:

\"hello \\\\\\\\uDd1c world \\\\xdd\\\\x1c\"

which is properly escaping the \ and is mantaining the casing correctly as well.