apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.44k stars 3.52k forks source link

[C++] Improve performance of ExecuteScalarExpression #31546

Open asfimport opened 2 years ago

asfimport commented 2 years ago

One of the things we want to be able to do in the streaming execution engine is process data in small L2 sized batches. Based on literature we might like to use batches somewhere in the range of 1k to 16k rows. In ARROW-16014 we created a benchmark to measure the performance of ExecuteScalarExpression as the size of our batches got smaller. There are two things we observed:

Reporter: Weston Pace / @westonpace

Subtasks:

Note: This issue was originally created as ARROW-16138. Please see the migration documentation for further details.

asfimport commented 2 years ago

Weston Pace / @westonpace: Some suggestions I have heard:

asfimport commented 2 years ago

David Li / @lidavidm: Have we profiled to see where the overhead is? (Though I suppose it may not matter, if we just want to get rid of it all.)

We may need to do some work to enable more kernels to be able to take advantage of preallocated buffers. Not all currently do and it's not necessarily clear which are which (so even if you could preallocate the output array in ExecuteScalarExpression, the kernel might discard it anyways).

For the first suggestion: what is dispatch referring to here? Resolving the kernel? I thought binding an expression also resolved the kernel, I may be wrong

asfimport commented 2 years ago

Weston Pace / @westonpace:

Have we profiled to see where the overhead is? (Though I suppose it may not matter, if we just want to get rid of it all.)

No, but I do think profiling would be a good idea. Even if we find the bottleneck is in some "dispatch" phase that we can get rid of it would be good to prove that first before we start throwing solutions at it. Mostly I was jotting these ideas down before I forget them. @zagto is planning on looking into this further.

We may need to do some work to enable more kernels to be able to take advantage of preallocated buffers. Not all currently do and it's not necessarily clear which are which (so even if you could preallocate the output array in ExecuteScalarExpression, the kernel might discard it anyways).

Good point. Some kernels will never support preallocation I think too. For example, if we are dealing with any variable length arrays like strings we won't necessarily know a "max buffer size" even if we know a "max batch size".

For the first suggestion: what is dispatch referring to here? Resolving the kernel? I thought binding an expression also resolved the kernel, I may be wrong

The benchmark was running a bound expression. However, I will admit that I have almost no idea how this process works :). It's possible that there is nothing wrong with the dispatch mechanism itself and something related to the individual kernel execution. We did try several different expressions in the benchmark.

asfimport commented 2 years ago

Tobias Zagorni / @zagto: The thread contention in small batch sizes are largely caused by copying/destructing shared pointers to DataType. Different threads constantly changing the refcount of the Int64 DataType seems to causes a lot of inter-core syncronization

Flamegraph.png

asfimport commented 2 years ago

Weston Pace / @westonpace: Ah, I suppose that makes sense. Might be a bit of an interesting one to fix up. I'll create a sub-task to address this issue (maybe it will be the only issue, who knows)

asfimport commented 2 years ago

Weston Pace / @westonpace: I've created ARROW-16161 to discuss the shared_ptr copy overhead issue.

asfimport commented 2 years ago

Todd Farmer / @toddfarmer: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.