An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
Setting delta.dataSkippingStatsColumns to a struct column will fail if any column inside the struct does not support gathering stats (binary, arrays, maps, etc). This should be more lenient to just skip columns that aren't supported rather than throwing an exception.
Motivation
Our use case is a single top level struct that has all the import fields we care about, and we want to gather stats on all of them. But this struct has a mix of column types, including arrays and maps, which means we currently can't use delta.dataSkippingStatsColumns. We also have different columns in this struct in various tables, so trying to include each individually supported field would also be extremely difficult. This is also inconsistent with dataSkippingNumIndexedCols, which allows for these unsupported types (and in fact still gathers null counts on them, even if it can't gather min/max stats).
While it may make sense to raise in exception if the direct column specified by dataSkippingStatsColumns is unsupported, it should be allowed to specify a struct that has a mix of types within it.
Further details
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
[x] Yes. I can contribute this feature independently.
[ ] Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
[ ] No. I cannot contribute this feature at this time.
Feature request
Which Delta project/connector is this regarding?
Overview
Setting
delta.dataSkippingStatsColumns
to a struct column will fail if any column inside the struct does not support gathering stats (binary, arrays, maps, etc). This should be more lenient to just skip columns that aren't supported rather than throwing an exception.Motivation
Our use case is a single top level struct that has all the import fields we care about, and we want to gather stats on all of them. But this struct has a mix of column types, including arrays and maps, which means we currently can't use
delta.dataSkippingStatsColumns
. We also have different columns in this struct in various tables, so trying to include each individually supported field would also be extremely difficult. This is also inconsistent withdataSkippingNumIndexedCols
, which allows for these unsupported types (and in fact still gathers null counts on them, even if it can't gather min/max stats).While it may make sense to raise in exception if the direct column specified by
dataSkippingStatsColumns
is unsupported, it should be allowed to specify a struct that has a mix of types within it.Further details
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?