NVIDIA / stdexec

`std::execution`, the proposed C++ framework for asynchronous and parallel programming.
Apache License 2.0
1.56k stars 159 forks source link

improve performance at scale of `exec::static_thread_pool` #986

Open ericniebler opened 1 year ago

ericniebler commented 1 year ago

The design of stdexec's static_thread_pool's is similar to the one in libunifex, which apparently suffers at scale. It's safe to assume that our static_thread_pool has similar issues.

Investigate the issues discussed here and implement a solution that performs better.

ericniebler commented 1 year ago

See this reddit comment from someone who believes that techniques used by the Go scheduler would be applicable here.

ericniebler commented 1 year ago

Why doesn't every worker thread have its own queue that can never be accessed by other threads? If push is called from a worker thread, then the work gets put onto that thread's queue with no locking needed. (If push is called from a non-worker thread we do what we're doing now.) When pop is called we first try to dequeue an item from the thread-local queue, and if it's empty, we fall back to pulling from one of the shared queues.

I suppose this risks starvation of work items placed in the shared queues, which are essentially treated as if they have lower priority. We could probably tune that by having the worker threads occasionally pull from the shared queue even if their local queue has work to do. Or else have one or more threads only pulling from the shared queues.

gevtushenko commented 1 year ago

We could probably tune that by having the worker threads occasionally pull from the shared queue even if their local queue has work to do. ... Or else have one or more threads only pulling from the shared queues.

This could introduce performance regressions on systems with NUMA. If we consider bulk in repeat_n, having threads pulling from non-local queues would introduce NUMA traffic.

mfbalin commented 1 year ago

@ericniebler I don't think it is necessary to invent a scheduler from scratch. I would like to bring to your attention the following work: https://github.com/taskflow/taskflow. It has a highly efficient task-stealing scheduler. It is distributed under the MIT license. If a similar efficient scheduling algorithm was available in stdexec, that would be great. The publication about it is here: https://tsung-wei-huang.github.io/papers/tpds21-taskflow.pdf. When it comes to task stealing on machines with NUMA domains, one can prioritize stealing from queues on the same numa domain and only if unsuccessful, go on to steal from the queues in the other NUMA domain. This would effectively minimize the NUMA traffic that @senior-zero mentioned.

EDIT: Modern CPUs effectively have NUMA even in the same NUMA domain. If you take AMD's latest CPUs, they come in groups of 8 on a single chiplet. This idea of stealing from close neighbors can be applied at those scales in a hierarchical manner as well.

image

ericniebler commented 1 year ago

I don't think it is necessary to invent a scheduler from scratch. I would like to bring to your attention the following work: https://github.com/taskflow/taskflow. It has a highly efficient task-stealing scheduler.

That looks very cool. Mos def we should take good ideas (and good oss code) from wherever we can get it. There's also value in having a simple, efficient-ish thread pool scheduler as an example. Maybe something based on the taskflow scheduler design could serve both needs?

weilewei commented 1 year ago

Just wanna ping @tsung-wei-huang who is the developer of TaskFlow. Earlier this summer, I have talked to Tsung-Wei about exploring sender + TaskFlow. Let me know if there is any collaboration opportunity : )