Open echai58 opened 5 months ago
Can you set engine="rust"
and see if the error replicates?
Can you set
engine="rust"
and see if the error replicates?
@abhiaagarwal The error does not replicate with engine="rust"
. The commit file has the following:
\"hello \\\\\\\\uDd1c world \\\\xdd\\\\x1c\"
which is properly escaping the \
and is mantaining the casing correctly as well.
Environment
Delta-rs version: 0.18.1
Binding: python
Bug
What happened: I encountered a bizarre exception
when trying to load one of my delta tables. After doing some investigation, I discovered some really weird behavior around how delta-rs encodes binary in the commit file json, specifically for the min/maxValues.
Essentially, the way I understand how a binary gets encoded as a string is as follows: any byte whose codepoint is < 128 gets encoded as-is, and any byte whose codepoint > 128 gets encoded as its unicode codepoint. e.g.
becomes
However, when there happens to be a sequence of bytes in the binary that falls in the UTF-16 surrogate range, when delta-rs tries to decode the string, it sees the surrogate and tries to decode it as UTF-16, but since it's not actually valid UTF-16 surrogate, it fails.
For example:
The write goes through, but now the DeltaTable is corrupt, and you cannot load the DeltaTable anymore. Inspecting the commit, we see
and since
\udd1c
falls in the surrogate range, we see a json parse error about that.Furthermore, the capitalized letter gets written as lowercase, which is just incorrect, but probably unrelated.
What you expected to happen: When I write the same data using spark, I see the commit file generated by spark does not write any min/max values for binary columns. I expect this to be the correct resolution for this bug.