Open ZhangHuiGui opened 5 months ago
Hi, @felipecrv I'm working on this improvement recently and find there is an important issue that need to be discussed:
When we process arrays case by case, how to determine the offset of the selected-rows of the current case in the output array.
For example, we have 2 cases and 3 arrays:
case_when({cond1, cond2}, array1, array2, array3);
We use the bitmap of cas1's cond to get the selected-rows of array1, if we want to fill these rows into the output list-vew array in an unordered way directly, we need to know which offsets of the output-array the selected-rows of array1 should be filled into. However, each list-item of array1 and array2 may have different sizes, so the offset of the list-item when we fill it into the output-array in an unordered way cannot be determined.
Is there a more efficient way to solve this problem?
You can't implement this optimization before you have special forms in the expression system and selection vectors being passed to kernels that can take advantage of it:
https://github.com/apache/arrow/issues/41094#issuecomment-2087716483
Expanding. Let's say we have the following relational query formula that produces a list-view typed column from list-views a
, b
, and c
:
CASE
WHEN list_value_length(a) > 1 THEN list_slice(a, 1)
WHEN list_value_length(b) > 1 THEN list_slice(b, 1)
ELSE coalesce(c, [0])
END AS resulting_list
In a LISP-like intermediate representation, this can be written as:
(cond
(> (list_value_length a) 1) (list_slice a 1)
(> (list_value_length b) 1) (list_slice b 1)
else (coalesce c [0]))
Due to how expressions are currently represented in arrow::compute
we are forced to eagerly eval all the conditions and branches. Even when the condition is false for that branch. We need cond
to be a special-form [1] meaning that all expressions are passed "by name" [2].
The evaluation of the expressions for this example should happen in this order:
auto output = pre_allocate_list_view(batch.length);
// builds an array with the lengths of a
auto length_a = list_value_length(a);
// builds the first pair of sel. vectors (for the first condition)
auto sel0_true, sel0_false = greater_than(length_a, 1);
// evaluate the first expression only on positions in sel0_true.
// since `output` is a list-view, all these random positions in the
// output can be written now before the other branches write to
// their positions.
// (we pass sel0_true to the list_slice kernel as an optional parameter)
list_slice[sel0_true](a, 1, &output);
// builds an array with the lengths of b
// (note that now we pass the sel0_false selection vector because
// we only have to run `list_value_length` on the positions where the
// previous conditions were false)
auto length_b = list_value_length[sel0_false](b);
// builds the next pair of selection vectors
// (note that greater_than[sel0_false] only checks the relevant positions,
// consequently, sel1_true/false are going to be smaller than the first
// pair of selection vectors)
auto sel1_true, sel1_false = greater_than[sel0_false](length_a, 1);
// evaluate the second branch's expression. none of the positions
// in sel1_true were present in sel0_true, so we're only writing on
// values that weren't written before
list_slice[sel1_true](b, 1, &output);
// now evaluate the else case with the last negative selection vector
coalesce[sel1_false](c, [0])
As you can see, it's not about optimize the case_when
or if_then_else
functions, but it's about optimizing kernels that output list-views to consider an optional selection vector parameter passed by the evaluator when it's eval'ing a special form expression.
[1] https://github.com/apache/arrow/issues/41094#issuecomment-2087716483 [2] Call by name (easily and often confused as "lazy evaluation")
Describe the enhancement requested
Suggested at https://github.com/apache/arrow/pull/41419#discussion_r1583854772.
The list-view types should have their own specialization of
"case_when"
.Since list-views can have values appended to the child values array in any order with offset/size pairs being adjusted accordingly, cases can be processed one after the other.
It might be faster to source data case by case instead of by each item.
Component(s)
C++