apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.34k stars 3.48k forks source link

[C++] A more efficient "case_when" specialization for list-view types #41453

Open ZhangHuiGui opened 5 months ago

ZhangHuiGui commented 5 months ago

Describe the enhancement requested

Suggested at https://github.com/apache/arrow/pull/41419#discussion_r1583854772.

The list-view types should have their own specialization of "case_when".

Since list-views can have values appended to the child values array in any order with offset/size pairs being adjusted accordingly, cases can be processed one after the other.

It might be faster to source data case by case instead of by each item.

Component(s)

C++

ZhangHuiGui commented 3 months ago

Hi, @felipecrv I'm working on this improvement recently and find there is an important issue that need to be discussed:

When we process arrays case by case, how to determine the offset of the selected-rows of the current case in the output array.

For example, we have 2 cases and 3 arrays:

case_when({cond1, cond2}, array1, array2, array3);

We use the bitmap of cas1's cond to get the selected-rows of array1, if we want to fill these rows into the output list-vew array in an unordered way directly, we need to know which offsets of the output-array the selected-rows of array1 should be filled into. However, each list-item of array1 and array2 may have different sizes, so the offset of the list-item when we fill it into the output-array in an unordered way cannot be determined.

Is there a more efficient way to solve this problem?

felipecrv commented 3 months ago

You can't implement this optimization before you have special forms in the expression system and selection vectors being passed to kernels that can take advantage of it:

https://github.com/apache/arrow/issues/41094#issuecomment-2087716483

felipecrv commented 3 months ago

Expanding. Let's say we have the following relational query formula that produces a list-view typed column from list-views a, b, and c:

CASE
    WHEN list_value_length(a) > 1 THEN list_slice(a, 1)
    WHEN list_value_length(b) > 1 THEN list_slice(b, 1)
    ELSE coalesce(c, [0])
END AS resulting_list

In a LISP-like intermediate representation, this can be written as:

(cond
  (> (list_value_length a) 1) (list_slice a 1)
  (> (list_value_length b) 1) (list_slice b 1)
  else     (coalesce c [0]))

Due to how expressions are currently represented in arrow::compute we are forced to eagerly eval all the conditions and branches. Even when the condition is false for that branch. We need cond to be a special-form [1] meaning that all expressions are passed "by name" [2].

The evaluation of the expressions for this example should happen in this order:

auto output = pre_allocate_list_view(batch.length);

// builds an array with the lengths of a
auto length_a = list_value_length(a);  
// builds the first pair of sel. vectors (for the first condition)
auto sel0_true, sel0_false = greater_than(length_a, 1);

// evaluate the first expression only on positions in sel0_true.
// since `output` is a list-view, all these random positions in the
// output can be written now before the other branches write to
// their positions.
// (we pass sel0_true to the list_slice kernel as an optional parameter)
list_slice[sel0_true](a, 1, &output);

// builds an array with the lengths of b
// (note that now we pass the sel0_false selection vector because
// we only have to run `list_value_length` on the positions where the
// previous conditions were false)
auto length_b = list_value_length[sel0_false](b);
// builds the next pair of selection vectors
// (note that greater_than[sel0_false] only checks the relevant positions,
// consequently, sel1_true/false are going to be smaller than the first
// pair of selection vectors)
auto sel1_true, sel1_false = greater_than[sel0_false](length_a, 1);

// evaluate the second branch's expression. none of the positions
// in sel1_true were present in sel0_true, so we're only writing on
// values that weren't written before
list_slice[sel1_true](b, 1, &output);

// now evaluate the else case with the last negative selection vector
coalesce[sel1_false](c, [0])

As you can see, it's not about optimize the case_when or if_then_else functions, but it's about optimizing kernels that output list-views to consider an optional selection vector parameter passed by the evaluator when it's eval'ing a special form expression.

[1] https://github.com/apache/arrow/issues/41094#issuecomment-2087716483 [2] Call by name (easily and often confused as "lazy evaluation")