delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.14k stars 380 forks source link

Write monotonic sequence, but read is non monotonic #2659

Closed mikeburkat closed 3 weeks ago

mikeburkat commented 1 month ago

Environment

Delta-rs version: 0.18.2

Binding: python

Environment:


Bug

What happened: I wrote a monotonically incrementing sequence into a deltalake table using the pyarrow engine. When reading this deltalake table, the data is no longer monotonically incrementing.

What you expected to happen: I expect the data to be monotonically incrementing. The rust engine appears to work as expected, however the pyarrow engine appears to re-order the data.

How to reproduce it: Minimal example which reproduces the bug consistently on my laptop.

# out_of_order.py

import argparse
import os

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

from deltalake import write_deltalake, DeltaTable

def write_data_file(file, schema, length, batch_size):
    with pq.ParquetWriter(file,
                          schema=schema,
                          compression='gzip',
                          compression_level=6) as writer:

        for i in range(0, length, batch_size):
            rows = min(i + batch_size, length) - i
            df = pd.DataFrame(range(i, i + rows, 1), columns=['increment'])
            batch = pa.record_batch(schema=schema, data=df)
            writer.write_batch(batch)

    df = pd.read_parquet(file)
    assert df['increment'].is_monotonic_increasing, 'data file not monotonic'

def write_delta(engine, uri, schema, file, batch_size):
    with pq.ParquetFile(file) as data:
        write_deltalake(table_or_uri=uri,
                        data=data.iter_batches(batch_size=batch_size),
                        schema=schema,
                        mode='overwrite',
                        engine=engine)

def assert_monotonic(engine, uri):
    dt = DeltaTable(uri)
    assert dt.to_pandas()['increment'].is_monotonic_increasing, f'{engine} not monotonic'

if __name__ == '__main__':
    parser = argparse.ArgumentParser(usage=__doc__, formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    parser.add_argument('--path', required=True, help='Deltalake table path')
    parser.add_argument('--length', default=62914561, type=int, help='Dataset length')
    parser.add_argument('--batch-size', default=100_000, type=int, help='Batch size')
    args = parser.parse_args()

    schema = pa.schema([
        pa.field('increment', pa.int64(), nullable=False),
    ])

    os.makedirs(args.path, exist_ok=True)
    file = args.path + '/monotonic'
    write_data_file(file, schema, args.length, args.batch_size)

    uri = args.path + '/rust'
    write_delta('rust', uri, schema, file, args.batch_size)
    assert_monotonic('rust', uri)

    uri = args.path + '/pyarrow'
    write_delta('pyarrow', uri, schema, file, args.batch_size)
    assert_monotonic('pyarrow', uri)

Run using the command:

python out_of_order.py --path $PWD/out_of_order

Following exception is raised:

Traceback (most recent call last):
  File ".../out_of_order.py", line 64, in <module>
    assert_monotonic('pyarrow', uri)
  File ".../out_of_order.py", line 40, in assert_monotonic
    assert dt.to_pandas()['increment'].is_monotonic_increasing, f'{engine} not monotonic'
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: pyarrow not monotonic

More details: The file which appears to be out of order on my machine is part-5 ($PWD/out_of_order/pyarrow/0-6e748f6b-f69e-47dd-8857-dc652c73cfef-5.parquet). This is reproducible when running multiple times, implying it's somewhat deterministic, however it appears that a different "row group" is out of order each time, given that when inspecting the file a different increment number is non monotonic.

mikeburkat commented 1 month ago

Looks like this is mainly due to a known issue in pyarrow: https://github.com/apache/arrow/issues/39030

However, I did find that a _delta_log transaction which has multiple add actions can have it's part files in an unsorted order in the transaction, which also contributes to this problem given that if the add action files are read in the "transaction order" then the data can also appear unsorted even though data in individual files is ordered.

ion-elgreco commented 3 weeks ago

This is something we cannot guarantee in the first place and the PyArrow engine will be deprecated as of v0.19, so I am closing this issue