delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.49k stars 1.68k forks source link

[Feature Request] Make `delta.dataSkippingStatsColumns` more lenient for nested columns #2822

Open Kimahriman opened 6 months ago

Kimahriman commented 6 months ago

Feature request

Which Delta project/connector is this regarding?

Overview

Setting delta.dataSkippingStatsColumns to a struct column will fail if any column inside the struct does not support gathering stats (binary, arrays, maps, etc). This should be more lenient to just skip columns that aren't supported rather than throwing an exception.

Motivation

Our use case is a single top level struct that has all the import fields we care about, and we want to gather stats on all of them. But this struct has a mix of column types, including arrays and maps, which means we currently can't use delta.dataSkippingStatsColumns. We also have different columns in this struct in various tables, so trying to include each individually supported field would also be extremely difficult. This is also inconsistent with dataSkippingNumIndexedCols, which allows for these unsupported types (and in fact still gathers null counts on them, even if it can't gather min/max stats).

While it may make sense to raise in exception if the direct column specified by dataSkippingStatsColumns is unsupported, it should be allowed to specify a struct that has a mix of types within it.

Further details

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

kamcheungting-db commented 1 month ago

This is not a nice behavior. If we don't throw error, then we implicitly swallow the unsupported exception.