delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.65k stars 1.72k forks source link

Support for Dynamic File Pruning in opensource DeltaLake #1323

Closed lkrishna-cs closed 2 years ago

lkrishna-cs commented 2 years ago

Feature request

Overview

We have lots of use cases that require fast query access to DeltaLake which doesn't seem to scale very much. Reading through the databricks documentation below it looks like the dynamic file pruning feature is what we want to leverage. However, this isn't available in Delta Lake open source.

https://www.databricks.com/blog/2020/04/30/faster-sql-queries-on-delta-lake-with-dynamic-file-pruning.html

Motivation

This feature will greatly help us from using native DeltaLake capabilities rather than looking at other options for low latency reads.

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

tdas commented 2 years ago

This is unfortunately an Apache Spark limitation, not a Delta Lake limitation. Apache Spark current supports only Dynamic Partition Pruning, not Dynamic File Pruning. These kind of query optimization is the responsibility of the processing engine, not the data format.

akshaythakur1112 commented 1 year ago

@tdas in that case, this issue should be added to spark if I am not wrong. I also feel that this is an important enhancement in terms of performance boost. Please correct me if I am wrong. Since databricks has this, is this their propriety thing even today?