delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.25k stars 400 forks source link

Z-order is no-op for strings with identical prefix of length >= 14 #2844

Open cjolowicz opened 1 month ago

cjolowicz commented 1 month ago

Environment

Delta-rs version:

0.19.1

Binding: Python and Rust

Environment:


Bug

What happened:

Apply z-order to a Delta Table on a column that contains strings with identical prefixes of at least 14 characters. The records in the new Parquet files retain their original order.

I initially witnessed this when z-ordering a large partition on ISO 8601 timestamps using delta-rs in Rust. I've since reproduced this with Python bindings and a small data frame using strings containing zero-padded integers (see repro below).

What you expected to happen:

The resulting Parquet files are ordered by the column specified for z-ordering.

How to reproduce it:

# test_zorder.py
import shutil
import pandas
from deltalake import write_deltalake, DeltaTable

def test_zorder() -> None:
    table = "a"
    field = "b"
    items = [f"{item:015}" for item in [2, 3, 1]]

    shutil.rmtree(table, ignore_errors=True)

    write_deltalake(table, pandas.DataFrame({field: items}))

    DeltaTable(table).optimize.z_order([field])

    sorted_items = DeltaTable(table).to_pyarrow_table().to_pydict()[field]

    assert sorted(items) == sorted_items

Run this with uv:

# caveat: this removes a directory named `a` from the current directory
uvx --with deltalake --with pandas pytest -vv test_zorder.py

Output:

========================= test session starts ==========================
platform linux -- Python 3.12.5, pytest-8.3.2, pluggy-1.5.0 -- /home/claudio/.cache/uv/archive-v0/A-uQ68p-4BWRUFltJ5Mv2/bin/python
cachedir: .pytest_cache
rootdir: ...
collected 1 item                                                       

test_zorder.py::test_zorder FAILED                               [100%]

=============================== FAILURES ===============================
_____________________________ test_zorder ______________________________

...

>       assert sorted(items) == sorted_items
E       AssertionError: assert ['000000000000001', '000000000000002', '000000000000003'] == ['000000000000002', '000000000000003', '000000000000001']
E         
E         At index 0 diff: '000000000000001' != '000000000000002'
E         
E         Full diff:
E           [
E         +     '000000000000001',
E               '000000000000002',
E               '000000000000003',
E         -     '000000000000001',
E           ]

test_zorder.py:20: AssertionError
======================= short test summary info ========================
FAILED test_zorder.py::test_zorder - AssertionError: assert ['000000000000001', '000000000000002', '000000000000003'] == ['000000000000002', '000000000000003', '000000000000001']

  At index 0 diff: '000000000000001' != '000000000000002'

  Full diff:
    [
  +     '000000000000001',
        '000000000000002',
        '000000000000003',
  -     '000000000000001',
    ]
========================== 1 failed in 0.36s ===========================

More details:

N/A

cjolowicz commented 1 month ago

This seems to be an issue with how we compute the z-order key. Here's the output of df.show() during read_zorder when running the failing test above:

+-----------------+----------------------------------+
| b               | __zorder_key                     |
+-----------------+----------------------------------+
| 000000000000002 | 023030303030303030ff303030303030 |
| 000000000000003 | 023030303030303030ff303030303030 |
| 000000000000001 | 023030303030303030ff303030303030 |
+-----------------+----------------------------------+

Every row gets the same z-order key even though they have distinct values in the z-order column b.

Update: This happens because we only look at the first 16 bytes of each column to compute the z-order key.

https://github.com/delta-io/delta-rs/blob/de11e6bd4a64d6ccaba01b16394d47d667e34797/crates/core/src/operations/optimize.rs#L1426-L1431

I've updated the issue description ("with identical prefixes of at least 14 characters").

This limitation is mentioned in the PR for the original z-order implementation: https://github.com/delta-io/delta-rs/pull/1429#issue-1739756523

cjolowicz commented 1 month ago

@wjones127 The z-order design document recommends an implementation that would avoid the issue of dropping bytes for long strings (decision 1, option 3). It seems we ended up going with option 1 instead, which does have that issue. Do you think option 3 is still a viable approach for us? Any pointers for how to implement this?

Another, simpler option would be to make the number of significant bytes per z-order column configurable.