apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.36k stars 3.49k forks source link

[C++] Implement work-stealing scheduler / multiple queues in ThreadPool #18301

Open asfimport opened 4 years ago

asfimport commented 4 years ago

This involves a change from a single task queue shared amongst all threads to a per-thread task queue and the ability for idle threads to take tasks from other threads' queues (work stealing).

As part of this, the task submission API would need to be evolved in some fashion to allow for tasks related to a particular workload to end up in the same task queue

Reporter: Wes McKinney / @wesm

Subtasks:

Note: This issue was originally created as ARROW-10117. Please see the migration documentation for further details.

asfimport commented 3 years ago

Weston Pace / @westonpace: I'm a little confused by the word "workload" in the second paragraph.  Traditional work stealing attempts to keep tasks together based on thread/core to preserve cache coherency.  This is what appears to be described in the first paragraph.

 

In the second paragraph are you asking for the capability to also group tasks based on workload?  If so, I'm not sure what the benefit would be.  If not, I don't think we'll end up needing to modify the API.  A task can keep a thread_local reference to its queue.

asfimport commented 3 years ago

Weston Pace / @westonpace: I'm willing/interested in taking a stab at this but I'm wondering a bit about evaluation and warning up-front that I don't expect to see any benefit anywhere.  Pretty much all existing parallelism can be broken down into two categories:

Task-per-batch - In this case we are creating a new task to handle a new file or block of data.  For example, when scanning a dataset with multiple files we create a task for each file.  When doing asynchronous readahead we create a task per record batch.

Task-per-column - In a few places we take this approach to divide expensive work across columns.

 

For future work (execution engine) the latest discussion seems to be around "morsel-driven execution model".  If I am stripping away the academia properly a "morsel-driven execution model" simply means "task-per-batch".

 

Task-per-batch execution will never really benefit from a work-stealing scheduler.  Each task is operating on a brand new chunk of data.  There isn't any cache coherency to benefit from.

Task-per-column might benefit in two situations:

  * If there is enough other work going on at the same time then multiple queues would be a way to disable task-per-column

  * If used within a task-per-batch parallel pipeline this would disable task-per-column dynamically if there were sufficient batches (this is basically a special case of the above bullet).

 

There is one other special case of parallelism and that is the background readers.  One could theorize that multiple queues synchronized across the I/O and CPU thread pools might allow you to match up your I/O core with your processing core.  However, in practice I have found that if the pipeline is not I/O bound then it's a moot point and if the pipeline is I/O bound then the context switch overhead is minimal.

 

TL;DR: I will create an artificial benchmark that should see obvious benefits from work stealing.  I'll create a work stealing thread pool using the artificial benchmark to verify I did things correctly.  I'll then turn it on, run conbench, and cross my fingers but I'm not expecting we will see any change (but I'm more than happy to be pleasantly surprised).

asfimport commented 3 years ago

Weston Pace / @westonpace: One interesting thing to note as my work on this is progressing is that the difference between how tasks are scheduled (ARROW-12903) is having a greater and greater impact on performance.  As an example, in my latest benchmarks, a tiny reference workload runs at ~2 million tasks per second.  With 8 threads and poor scheduling we end up with 620k tasks per second (this is similar to the baseline performance).  With 8 threads and ideal scheduling we end up with 7-8 million tasks per second.

It's possible we can find some ways to better handle the worst-case scenario but it's not something I'm going to be tackling as part of this PR.

asfimport commented 3 years ago

Weston Pace / @westonpace: Another note is that a lot of academic literature (and techniques like Tokio's thread pool) are willing to introduce a lot of complexity to maximize thread pool performance.  However, Arrow does not spawn millions of thread tasks today.  Tools like morsel-driven execution and fused operators allow us to run with thousands of thread tasks per second.  It's not clear to me how much maintenance headaches we want to introduce to squeeze every last drop out of the thread pool.  At the moment I'm aiming to create the complex case first, determine performance, implement a more naive work-stealing implementation, and then re-benchmark so we can be more informed in the decision.

asfimport commented 3 years ago

Weston Pace / @westonpace: Other than the minor improvement to thread pool overhead described above I saw no real significant improvements with the work stealing thread pool.  I'd like to get ARROW-12878 merged as it will make future experiments on thread pool strategies easier to perform.  We could perhaps merge ARROW-12902 but I could understand if there was a desire to wait for some actual macro-improvement before doing so.  I've attached a summary of my investigation.Work_Stealing_Thread_Pool_Investigation.pdf

asfimport commented 2 years ago

Todd Farmer / @toddfarmer: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.