duckdb / duckdb_delta

DuckDB extension for Delta Lake
MIT License
88 stars 8 forks source link

Shouldn't stop just because a step returned no files #27

Closed nicklan closed 3 weeks ago

nicklan commented 4 weeks ago

This was fun :). This fixes https://github.com/delta-incubator/delta-kernel-rs/issues/233

Basically, if you push down a predicate, you can have a situation where a batch of files that does include an Add file, doesn't actually return any files to scan, because they are all filtered out. The kernel can't know this for sure because we don't introspect the data until the engine asks us to extract it for them. So in the case of running:

SELECT letter, number
FROM delta_scan('${DAT_PATH}/out/reader_tests/generated/basic_append/delta')
WHERE number < 2

the first batch included one file, but it's filtered out by the predicate, so nothing actually came out and resolved_files.size() == size_before would be true, so duckdb would just stop looking for more files. But there is one more file to scan, the one with the data we want! :)

The simple fix is to keep iterating until the kernel tells you you can be sure there's no more data.

There's a chance the kernel could optimize more and not have returned the first batch, but in general I think engines should assume they should keep iterating until scan_data_next returns false

samansmink commented 3 weeks ago

thanks, @nicklan!