Open samansmink opened 5 months ago
Hey Sam, I'm currently working on this. Right now data skipping doesn't take hive style partition paths like this into account, I have to upstream a few expression changes for this to also be compatible in delta-rs, but just so you're aware it's on my radar.
In the duckdb delta extension I'm not seeing file skipping based on partitions.
So first of all, file skipping based on stats does seem to work correctly. For example, in the following case kernel correctly skips the files based on the predicate. So only 1 of the two files is passed to duckdb for scanning:
Now I have added some test data in duckdb delta which aims to test file skipping for all types that we can push down now. To do so I generate a few tables in the format
/generated/test_file_skipping/{type}/delta_lake
. See the line generating these tables here.Now what I would expect is to be able to skip by this table using:
However when I instrument DuckDB to print the files kernel is passing me, I can see that even though the filter is pushed down, both files are passed: