Closed t1g0rz closed 2 months ago
@t1g0rz I'm not sure if we are properly respecting these configurations yet.
To select only columns this should be the one: delta.dataSkippingStatsColumns.
But again I'm not sure if we use this yet correctly, I'll take a deeper look into this tomorrow
@t1g0rz I checked, but we don't respect it yet 😄
Buttt I am working on a fix 🕺
Environment
Delta-rs version: 0.16.4
Binding: python
Description
Here documentation says:
And here guides us to explicitly specify columns for file skipping:
I couldn't find a way to exclude some columns from file skipping purposes. In my case, I have quite wide tables (> 200 columns), and out of these, 195 will never be used for file skipping.
An unintended consequence of including all these columns in statistics calculation is the explosive growth of the delta log size because it writes very long strings of min/max/nulls. This indirectly creates all conditions for issue #2301 and this from delta slack.
I found the
delta.dataSkippingNumIndexedCols
setting here and I wonder if it's possible to explicitly specify column names. Or should I reorder the table to have skipping columns first and then set delta.dataSkippingNumIndexedCols?