Open rishitesh-snt opened 10 months ago
Someone please correct me if I am mistaken, but looks like there is a plan to do this via this issue https://github.com/delta-io/delta/issues/2229
@zzl-7 This issue was specifically for reading from Spark, inline with the PR https://github.com/delta-io/delta/pull/1525. It is just one of the data skipping mechanism and can be included in the kernel as well, so that connector can benefit from it. IMO https://github.com/delta-io/delta/issues/2229 deals with the overall design of data skipping mechanism for kernel. Happy to get more feedback on this so that it can proceed in the right direction.
Feature request
Which Delta project/connector is this regarding?
Overview
A very typical use case while doing exploratory analysis is to check latest records with some limit, mostly to understand data pattern and behaviour. e.g.
select * from table order by timestamp desc limit 10
In normal scenario Spark would read all the files to get to top 10 records. However, if timestamp column creates mostly disjointed sets for each file we can just read the min/max & number of record to determine the top 10 records.
In the case of non disjoint sets also, we can improve the performance by reading a subset of files up to the number specified in the limit. In the above example it would be 10 files.
Motivation
Sorting the whole table can take number of minutes for 500GB + tables. Reading the metadata would give this information in seconds.
Further details
An example on disjoint sets
Query : select * from table order by timestamp desc limit 10
With the query being
select * from table order by timestamp desc limit 10
right now we need to read all the files. However, if we can make use of the metadata, we only need to read file number 3.An example on non disjoint sets
Query : select * from table order by timestamp asc limit 10
While working with non disjoint sets of file we can follow the below algorithm,
The same principle can be applied even after a partition filter is applied.
Limitation : It would be applicable only in the case of a single order by clause.
Even though it's applicable to very limited set of queries, the frequency of such queries are very high.
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?