I'd like to see z-order optimization be extended to the row_group by having a row_group end if it can't fit the next partition in it rather than splitting that partition. As it is now, if I query the table for that partition value, the reader has to read 2 row_groups instead of just 1.
I don't have an MRE for the data setup but here's a demo with my own data behind it.
from deltalake import DeltaTable, WriterProperties
import pyarrow.parquet as pq
import fsspec
dt_path=...
abfs= fsspec(...)
dt=DeltaTable(dt_path)
dt.optimize.z_order(
['node_id'],
writer_properties=WriterProperties(compression="ZSTD")
)
dtfile=dt.files()[0]
with abfs.open(f"{dt_path}/{dtfile}", "rb") as ff:
pqfile=pq.ParquetFile(ff)
stats=[]
for rg in range(pqfile.metadata.num_row_groups):
stats.append({'rg':rg,
'min_node':pqfile.metadata.row_group(rg).column(3).statistics.min,
'max_node':pqfile.metadata.row_group(rg).column(3).statistics.max,
'num_values':pqfile.metadata.row_group(rg).column(3).statistics.num_values
})
stats[0:5] # Just first 5 for brevity
[{'rg': 0, 'min_node': 1, 'max_node': 49202, 'num_values': 1048576},
{'rg': 1, 'min_node': 49202, 'max_node': 49636, 'num_values': 1048576},
{'rg': 2, 'min_node': 49636, 'max_node': 50496, 'num_values': 1048576},
{'rg': 3, 'min_node': 50496, 'max_node': 52458, 'num_values': 1048576},
{'rg': 4, 'min_node': 52458, 'max_node': 1048072, 'num_values': 1048576}]
Notice how the max_node in each row group is the min_node for the next row_group which means values of that node span two row_groups so if I query that node then it has to download 2 row groups instead of just 1.
It'd be better if the first row group stopped at 49201 (or whatever came before 49202) and then 49202 was solely in the second rg.
Use Case
Faster, more efficient queries of nodes that would otherwise be straddling 2 row_groups.
Description
I'd like to see z-order optimization be extended to the row_group by having a row_group end if it can't fit the next partition in it rather than splitting that partition. As it is now, if I query the table for that partition value, the reader has to read 2 row_groups instead of just 1.
I don't have an MRE for the data setup but here's a demo with my own data behind it.
Notice how the max_node in each row group is the min_node for the next row_group which means values of that node span two row_groups so if I query that node then it has to download 2 row groups instead of just 1.
It'd be better if the first row group stopped at 49201 (or whatever came before 49202) and then 49202 was solely in the second rg.
Use Case Faster, more efficient queries of nodes that would otherwise be straddling 2 row_groups.
Related Issue(s) unknown, maybe page index?