delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.24k stars 1.63k forks source link

[Feature Request][Kernel] Implement file skipping using file stats #2229

Closed vkorukanti closed 5 months ago

vkorukanti commented 8 months ago

Feature request

Overview

Currently Kernel only supports partition pruning for given predicate when reading Delta tables. This issue is to add support for file skipping using the file statistics stored in Delta Log.

Motivation

File skipping helps improve the performance of read queries by not reading files that can not possibly have the records that satisfy the given query predicate.

Further details

Design doc here (including project plan/task list) https://docs.google.com/document/d/1cgB002DQcxio4nGOrUIQwA11A5Uve6ryZc2p6sw4y9Q/edit?usp=sharing

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

zzl-7 commented 6 months ago

Can I ask if part of this task is open to public contribution?

allisonport-db commented 5 months ago

Initial support is merged; follow-up issues to come for remaining work