Open cjolowicz opened 1 month ago
This seems to be an issue with how we compute the z-order key. Here's the output of df.show()
during read_zorder
when running the failing test above:
+-----------------+----------------------------------+
| b | __zorder_key |
+-----------------+----------------------------------+
| 000000000000002 | 023030303030303030ff303030303030 |
| 000000000000003 | 023030303030303030ff303030303030 |
| 000000000000001 | 023030303030303030ff303030303030 |
+-----------------+----------------------------------+
Every row gets the same z-order key even though they have distinct values in the z-order column b
.
Update: This happens because we only look at the first 16 bytes of each column to compute the z-order key.
I've updated the issue description ("with identical prefixes of at least 14 characters").
This limitation is mentioned in the PR for the original z-order implementation: https://github.com/delta-io/delta-rs/pull/1429#issue-1739756523
@wjones127 The z-order design document recommends an implementation that would avoid the issue of dropping bytes for long strings (decision 1, option 3). It seems we ended up going with option 1 instead, which does have that issue. Do you think option 3 is still a viable approach for us? Any pointers for how to implement this?
Another, simpler option would be to make the number of significant bytes per z-order column configurable.
Environment
Delta-rs version:
0.19.1
Binding: Python and Rust
Environment:
Bug
What happened:
Apply z-order to a Delta Table on a column that contains strings with identical prefixes of at least 14 characters. The records in the new Parquet files retain their original order.
I initially witnessed this when z-ordering a large partition on ISO 8601 timestamps using delta-rs in Rust. I've since reproduced this with Python bindings and a small data frame using strings containing zero-padded integers (see repro below).
What you expected to happen:
The resulting Parquet files are ordered by the column specified for z-ordering.
How to reproduce it:
Run this with uv:
Output:
More details:
N/A