Basically, if you push down a predicate, you can have a situation where a batch of files that does include an Add file, doesn't actually return any files to scan, because they are all filtered out. The kernel can't know this for sure because we don't introspect the data until the engine asks us to extract it for them. So in the case of running:
SELECT letter, number
FROM delta_scan('${DAT_PATH}/out/reader_tests/generated/basic_append/delta')
WHERE number < 2
the first batch included one file, but it's filtered out by the predicate, so nothing actually came out and resolved_files.size() == size_before would be true, so duckdb would just stop looking for more files. But there is one more file to scan, the one with the data we want! :)
The simple fix is to keep iterating until the kernel tells you you can be sure there's no more data.
There's a chance the kernel could optimize more and not have returned the first batch, but in general I think engines should assume they should keep iterating until scan_data_next returns false
This was fun :). This fixes https://github.com/delta-incubator/delta-kernel-rs/issues/233
Basically, if you push down a predicate, you can have a situation where a batch of files that does include an
Add
file, doesn't actually return any files to scan, because they are all filtered out. The kernel can't know this for sure because we don't introspect the data until the engine asks us to extract it for them. So in the case of running:the first batch included one file, but it's filtered out by the predicate, so nothing actually came out and
resolved_files.size() == size_before
would be true, soduckdb
would just stop looking for more files. But there is one more file to scan, the one with the data we want! :)The simple fix is to keep iterating until the kernel tells you you can be sure there's no more data.
There's a chance the kernel could optimize more and not have returned the first batch, but in general I think engines should assume they should keep iterating until
scan_data_next
returnsfalse