delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.03k stars 365 forks source link

File skipping according to documentation #2427

Closed t1g0rz closed 2 months ago

t1g0rz commented 3 months ago

Environment

Delta-rs version: 0.16.4

Binding: python

Description

Here documentation says:

Ensure the transaction log stores metadata stats for all the columns that benefit from file skipping.

And here guides us to explicitly specify columns for file skipping:

It takes some time to compute column statistics when writing files, and it isn’t worth the effort if you cannot use the column for file skipping. Suppose you have a table column containing a long string of arbitrary text. It’s unlikely that this column would ever provide any data-skipping benefits. So, you can just avoid the overhead of collecting the statistics for this particular column.

I couldn't find a way to exclude some columns from file skipping purposes. In my case, I have quite wide tables (> 200 columns), and out of these, 195 will never be used for file skipping.
An unintended consequence of including all these columns in statistics calculation is the explosive growth of the delta log size because it writes very long strings of min/max/nulls. This indirectly creates all conditions for issue #2301 and this from delta slack.

I found the delta.dataSkippingNumIndexedCols setting here and I wonder if it's possible to explicitly specify column names. Or should I reorder the table to have skipping columns first and then set delta.dataSkippingNumIndexedCols?

ion-elgreco commented 3 months ago

@t1g0rz I'm not sure if we are properly respecting these configurations yet.

To select only columns this should be the one: delta.dataSkippingStatsColumns.

But again I'm not sure if we use this yet correctly, I'll take a deeper look into this tomorrow

ion-elgreco commented 3 months ago

@t1g0rz I checked, but we don't respect it yet 😄

Buttt I am working on a fix 🕺