delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.27k stars 402 forks source link

When z-order optimizing, keep partition in only one row_group (if possible) #2769

Open deanm0000 opened 2 months ago

deanm0000 commented 2 months ago

Description

I'd like to see z-order optimization be extended to the row_group by having a row_group end if it can't fit the next partition in it rather than splitting that partition. As it is now, if I query the table for that partition value, the reader has to read 2 row_groups instead of just 1.

I don't have an MRE for the data setup but here's a demo with my own data behind it.

from deltalake import DeltaTable, WriterProperties
import pyarrow.parquet as pq
import fsspec
dt_path=...
abfs= fsspec(...)
dt=DeltaTable(dt_path)
dt.optimize.z_order(
    ['node_id'], 
    writer_properties=WriterProperties(compression="ZSTD")
    )
dtfile=dt.files()[0]
with abfs.open(f"{dt_path}/{dtfile}", "rb") as ff:
    pqfile=pq.ParquetFile(ff)
stats=[]
for rg in range(pqfile.metadata.num_row_groups):
    stats.append({'rg':rg, 
                    'min_node':pqfile.metadata.row_group(rg).column(3).statistics.min,
                    'max_node':pqfile.metadata.row_group(rg).column(3).statistics.max,
                    'num_values':pqfile.metadata.row_group(rg).column(3).statistics.num_values
                    })
stats[0:5] # Just first 5 for brevity
[{'rg': 0, 'min_node': 1, 'max_node': 49202, 'num_values': 1048576},
 {'rg': 1, 'min_node': 49202, 'max_node': 49636, 'num_values': 1048576},
 {'rg': 2, 'min_node': 49636, 'max_node': 50496, 'num_values': 1048576},
 {'rg': 3, 'min_node': 50496, 'max_node': 52458, 'num_values': 1048576},
 {'rg': 4, 'min_node': 52458, 'max_node': 1048072, 'num_values': 1048576}]

Notice how the max_node in each row group is the min_node for the next row_group which means values of that node span two row_groups so if I query that node then it has to download 2 row groups instead of just 1.

It'd be better if the first row group stopped at 49201 (or whatever came before 49202) and then 49202 was solely in the second rg.

Use Case Faster, more efficient queries of nodes that would otherwise be straddling 2 row_groups.

Related Issue(s) unknown, maybe page index?

deanm0000 commented 1 month ago

I made this which does the above with pyarrow although just one partition at a time. It's on pypi