Open andygrove opened 3 months ago
Related issue at arrow-rs: https://github.com/apache/arrow-rs/issues/3620
This paper may have useful information:
"Filter Representation in Vectorized Query Execution" https://db.cs.cmu.edu/papers/2021/ngom-damon2021.pdf
What is the problem the feature request solves?
It is very common to have scan -> filter as inputs to a join. The copying of data in the filter can be expensive when the batch contains strings and complex types, and the result of the filter is discarded after the join.
I believe that it would be more efficient to have the join use a selection vector to read inputs from the scanned batch rather than perform a filter.
This issue is for tracking the work to create a small prototype to demonstrate. If succesful, then we can discuss making changes in upstream DataFusion to add support for a new
ColumnarValue::ArrayWithSelectionVector
and then add a specialization in SortMergeJoin to take advantage of this.Describe the potential solution
No response
Additional context
No response