apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
823 stars 163 forks source link

Evaluate use of selection vectors in scan-filter-join operations #745

Open andygrove opened 3 months ago

andygrove commented 3 months ago

What is the problem the feature request solves?

It is very common to have scan -> filter as inputs to a join. The copying of data in the filter can be expensive when the batch contains strings and complex types, and the result of the filter is discarded after the join.

I believe that it would be more efficient to have the join use a selection vector to read inputs from the scanned batch rather than perform a filter.

This issue is for tracking the work to create a small prototype to demonstrate. If succesful, then we can discuss making changes in upstream DataFusion to add support for a new ColumnarValue::ArrayWithSelectionVector and then add a specialization in SortMergeJoin to take advantage of this.

Describe the potential solution

No response

Additional context

No response

viirya commented 3 months ago

Related issue at arrow-rs: https://github.com/apache/arrow-rs/issues/3620

andygrove commented 3 months ago

This paper may have useful information:

"Filter Representation in Vectorized Query Execution" https://db.cs.cmu.edu/papers/2021/ngom-damon2021.pdf