delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.63k stars 1.71k forks source link

[Kernel][Data skipping] Support data skipping for <=> and NOT(<=>) using `IS NOT DISTINCT FROM` expression #2538

Open allisonport-db opened 10 months ago

allisonport-db commented 10 months ago

Feature request

Which Delta project/connector is this regarding?

Overview

Use file statistics to prune files based on the IS NOT DISTINCT FROM expression..

Motivation

Better file pruning.

Further details

This means we should 1) add NullSafeEquals to the Kernel Predicate and support it in the kernel-defaults project 2) Generate a data skipping filter according to the same rules we use in delta-spark

Note: In Spark, the NullPropogation rule transforms a <=> Null into IsNull(a) and Not(a <=> Null) into IsNotNull(a); we can do the same.

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

zzl-7 commented 8 months ago

If its ok, I would like to work on this :) I should have NullSafeEquals in couple days, would need to do a bit of digging on data skipping filter.

zzl-7 commented 7 months ago

Hi @allisonport-db I added support for <=> on PR https://github.com/delta-io/delta/pull/2830 if you have time can you take a look? My next step is to look at data skipping filter in another PR Thank you