[C++] Improve performance of ExecuteScalarExpression

asfimport commented 2 years ago

One of the things we want to be able to do in the streaming execution engine is process data in small L2 sized batches. Based on literature we might like to use batches somewhere in the range of 1k to 16k rows. In ARROW-16014 we created a benchmark to measure the performance of ExecuteScalarExpression as the size of our batches got smaller. There are two things we observed:

Something is causing thread contention. We should be able to get pretty close to perfect linear speedup when we are evaluating scalar expressions and the batch size fits entirely into L2. We are not seeing that.
The overhead of ExecuteScalarExpression is too high when processing small batches. Even when the expression is doing real work (e.g. copies, comparisons) the execution time starts to be dominated by overhead when we have 10k sized batches.

Reporter: Weston Pace / @westonpace

Subtasks:

_{Note: This issue was originally created as ARROW-16138. Please see the migration documentation for further details.}

asfimport commented 2 years ago

Weston Pace / @westonpace: Some suggestions I have heard:

We could separate dispatch from execution, allowing us to "prepare" an expression (do all the dispatch work) and then execute the expression many times.
We could avoid allocation for temporary buffers created during scalar execution. For example, if the expression is x > 0 && x < 20 then we know we will need two boolean temporary arrays in addition to the input array (x) and the output array (boolean). If we defined a "max batch size" then we could preallocate these two temporary arrays once and reuse them for every execution.
In the execution engine we may be able to preallocate the output buffer in some cases. For example, if we have filter => project then we know the output buffer is only ever going to be used as input to the project. This means, at plan creation time, we could allocate this filter=>project buffer once (per thread) and then reuse it for every execution. This would mean we would need some way to pass a preallocated output array to ExecuteScalarExpression.

asfimport commented 2 years ago

David Li / @lidavidm: Have we profiled to see where the overhead is? (Though I suppose it may not matter, if we just want to get rid of it all.)

We may need to do some work to enable more kernels to be able to take advantage of preallocated buffers. Not all currently do and it's not necessarily clear which are which (so even if you could preallocate the output array in ExecuteScalarExpression, the kernel might discard it anyways).

For the first suggestion: what is dispatch referring to here? Resolving the kernel? I thought binding an expression also resolved the kernel, I may be wrong

asfimport commented 2 years ago

Weston Pace / @westonpace:

Have we profiled to see where the overhead is? (Though I suppose it may not matter, if we just want to get rid of it all.)

No, but I do think profiling would be a good idea. Even if we find the bottleneck is in some "dispatch" phase that we can get rid of it would be good to prove that first before we start throwing solutions at it. Mostly I was jotting these ideas down before I forget them. @zagto is planning on looking into this further.

We may need to do some work to enable more kernels to be able to take advantage of preallocated buffers. Not all currently do and it's not necessarily clear which are which (so even if you could preallocate the output array in ExecuteScalarExpression, the kernel might discard it anyways).

Good point. Some kernels will never support preallocation I think too. For example, if we are dealing with any variable length arrays like strings we won't necessarily know a "max buffer size" even if we know a "max batch size".

For the first suggestion: what is dispatch referring to here? Resolving the kernel? I thought binding an expression also resolved the kernel, I may be wrong

The benchmark was running a bound expression. However, I will admit that I have almost no idea how this process works :). It's possible that there is nothing wrong with the dispatch mechanism itself and something related to the individual kernel execution. We did try several different expressions in the benchmark.

asfimport commented 2 years ago

Tobias Zagorni / @zagto: The thread contention in small batch sizes are largely caused by copying/destructing shared pointers to DataType. Different threads constantly changing the refcount of the Int64 DataType seems to causes a lot of inter-core syncronization

asfimport commented 2 years ago

Weston Pace / @westonpace: Ah, I suppose that makes sense. Might be a bit of an interesting one to fix up. I'll create a sub-task to address this issue (maybe it will be the only issue, who knows)

asfimport commented 2 years ago

Weston Pace / @westonpace: I've created ARROW-16161 to discuss the shared_ptr copy overhead issue.

asfimport commented 2 years ago

Todd Farmer / @toddfarmer: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

apache / arrow

[C++] Improve performance of ExecuteScalarExpression #31546

Subtasks:

Original Issue Attachments: