delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
1.99k stars 362 forks source link

Z-Order with larger dataset resulting in memory error #2284

Closed pyjads closed 3 months ago

pyjads commented 4 months ago

Environment

Windows (8 GB RAM)

Delta-rs version: 0.16.0


Bug

What happened:

from datetime import timedelta

delta = timedelta(seconds=60)

dt.optimize.z_order(
    ["user_id", "product"],
    max_spill_size=4194304000,
    min_commit_interval=delta,
    max_concurrent_tasks=1,
)

I am trying to execute z-order on the partitioned data. There are 65 partitions and each partition contains approx. 900 MB of data in approx. 16 parquet files with approx. 55 mb file size of each parquet. It results into following error

DeltaError: Failed to parse parquet: Parquet error: Z-order failed while scanning data: ResourcesExhausted("Failed to allocate additional 403718240 bytes for ExternalSorter[2] with 0 bytes already allocated - maximum available is 381425355").

I am new to deltalake and don't have much knowledge on how z_order work. Is it due to the large amount of data? I am trying to run it on my local laptop with limited resources.

ion-elgreco commented 3 months ago

@pyjads since you have a partitioned table you can run the optimize.z-order on each partition. You can use the partition_filters parameter for that

adriangb commented 1 month ago

Shouldn't delta-rs automatically be doing the z-order within partitions anyway since you can't z-order across partitions? And if a partition is too big to fit in memory, shouldn't it spill to disk?

Anecdotally spilling to disk does not seem to work, unless I set it to a very large value and spill to swap even a medium sized table can't be z-ordered.