Closed maximedion2 closed 6 months ago
Okay I removed the row filtering logic in the readers, and the predicate push downs are now marked as Inexact, should be good to go.
I think this seems good to me to get going on predicate pushdowns. Sometime this week I hope to play around with some of my data so that'll hopefully yield some insights into how this works for that use case.
Great, I'm looking forward to running all this against real data, I'm sure we'll find some issues or at least things to improve that would be hard to figure out with only the dummy test data I've been generating. I theory, I think most zarr features are supported with what's there, the main one I'm not supporting yet is missing chunks + fill values.
Okay so the relatively small amount of new code here doesn't reflect how long I spent on this haha, it's admittedly more complicated than I initially thought it would be. But I think this works in its current state.
It's a WIP because I want to revisit how I applied filter push downs in the reader. When I looked at arrow-rs, specifically at the parquet implementation, my understanding was that if you take in a predicate, you need to produce a record batch that completely satisfies it. However, looking into how it work in datafusion, it seems it's not the case, you can set a predicate as "inexact", in which case datafusion will take the record batches that were partially "filtered" and remove any row that's left that doesn't satisfy the predicate.
For parquet files, specifically when compression is not supported, there's value in trying to skip rows when reading, but we can't do that for (compressed) zarr data, so my somewhat complicated implementation of "row filtering" for zarr doesn't add any value. I want to simplify it and focus on what does add value -- skipping whole chunks when no values in the predicate chunks satisfy the predicate. I will then leave the "exact" filtering to datafusion, that will make things much cleaner.