delta-io / delta-kernel-rs

A native Delta implementation for integration with any query engine
Apache License 2.0
146 stars 42 forks source link

Fail to skip files based on partitions alone #263

Open samansmink opened 5 months ago

samansmink commented 5 months ago

In the duckdb delta extension I'm not seeing file skipping based on partitions.

So first of all, file skipping based on stats does seem to work correctly. For example, in the following case kernel correctly skips the files based on the predicate. So only 1 of the two files is passed to duckdb for scanning:

FROM delta_scan('${DAT_PATH}/out/reader_tests/generated/basic_append/delta')
WHERE number > 4

Now I have added some test data in duckdb delta which aims to test file skipping for all types that we can push down now. To do so I generate a few tables in the format /generated/test_file_skipping/{type}/delta_lake. See the line generating these tables here.

Now what I would expect is to be able to skip by this table using:

FROM delta_scan('./data/generated/test_file_skipping/bigint/delta_lake')
WHERE part=0

However when I instrument DuckDB to print the files kernel is passing me, I can see that even though the filter is pushed down, both files are passed:

 Pushing down filter part = 0
 Scanning path file:///Users/sam/Development/delta-kernel-testing/data/generated/test_file_skipping/bigint/delta_lake/part=0/0-00900a4a-99cf-4d43-993c-41950d6ed025-0.parquet
 Scanning path file:///Users/sam/Development/delta-kernel-testing/data/generated/test_file_skipping/bigint/delta_lake/part=1/0-00900a4a-99cf-4d43-993c-41950d6ed025-0.parquet
hntd187 commented 5 months ago

Hey Sam, I'm currently working on this. Right now data skipping doesn't take hive style partition paths like this into account, I have to upstream a few expression changes for this to also be compatible in delta-rs, but just so you're aware it's on my radar.