PaRSEC is a generic framework for architecture aware scheduling and management of micro-tasks on distributed, GPU accelerated, many-core heterogeneous architectures. PaRSEC assigns computation threads to the cores, GPU accelerators, overlaps communications and computations and uses a dynamic, fully-distributed scheduler based on architectural features such as NUMA nodes and algorithmic features such as data reuse.
Recursive tasks are all scheduled on the queue of the execution stream that executed the parent. This leads to suboptimal behavior in the case where
The recursive task is high priority
Queues on other execution streams are full
This leads to other execution streams never stealing the high-priority task, essentially serializing its execution and removing many benefits of executing it recursively. One example of this occurring is in POTRF tasks in HiCMA, where it appears that GEMM tasks are frequently scheduled before the recursive POTRF tasks, leading to global starvation once the GEMM tasks are complete. This is observable as oscillatory behavior in core utilization, even with reasonable levels of lookahead where there should always be many GEMM tasks available to execute.
The image below demonstrates the issue. This is max(idle_cores - queue_len, 0), where both idle_cores and queue_len are the global sum at each measurement point, approximately every millisecond. This demonstrates that cores are starving for work globally, which should never happen with Cholesky, and that they are being limited by the execution of the POTRF tasks on the diagonal, implying a priority inversion.
This naturally only affects schedulers with distributed queues; global schedulers such as AP are not affected. The above used the LLP scheduler.
Describe the solution you'd like
Several solutions come to mind:
Distribute scheduling of recursive tasks across all execution streams in the VP. This is possibly problematic to implement, since only access to execution stream 0 is guaranteed to be thread-safe, and e.g. the LLP scheduler makes use of this assumption.
Steal more aggressively, even when the local queue is not drained. I'm not sure what heuristic should be used for this. We might also need to make sure that schedulers steal the highest-priority task and not just the back of the queue—I don't recall what the current behavior is.
Mark tasks as "high priority" that are scheduled on a dedicated AP-like scheduler that is queried first before the local scheduler. This might increase overhead, but the issues in AP really come from management of absolute priorities with very many tasks; as long as the number of "high priority" tasks is fairly small, this shouldn't be too bad I think.
Describe alternatives you've considered
There is prior research into load-balancing, work-stealing, and task priority; we can look into a solution from current literature. Marc pointed me towards a paper from Vivek Sarkar's group in Euro-Par 2015, "Load Balancing Prioritized Tasks via Work-Stealing", which I will be reading immanently.
Description
Recursive tasks are all scheduled on the queue of the execution stream that executed the parent. This leads to suboptimal behavior in the case where
This leads to other execution streams never stealing the high-priority task, essentially serializing its execution and removing many benefits of executing it recursively. One example of this occurring is in POTRF tasks in HiCMA, where it appears that GEMM tasks are frequently scheduled before the recursive POTRF tasks, leading to global starvation once the GEMM tasks are complete. This is observable as oscillatory behavior in core utilization, even with reasonable levels of lookahead where there should always be many GEMM tasks available to execute.
The image below demonstrates the issue. This is
max(idle_cores - queue_len, 0)
, where bothidle_cores
andqueue_len
are the global sum at each measurement point, approximately every millisecond. This demonstrates that cores are starving for work globally, which should never happen with Cholesky, and that they are being limited by the execution of the POTRF tasks on the diagonal, implying a priority inversion.This naturally only affects schedulers with distributed queues; global schedulers such as AP are not affected. The above used the LLP scheduler.
Describe the solution you'd like
Several solutions come to mind:
Describe alternatives you've considered
There is prior research into load-balancing, work-stealing, and task priority; we can look into a solution from current literature. Marc pointed me towards a paper from Vivek Sarkar's group in Euro-Par 2015, "Load Balancing Prioritized Tasks via Work-Stealing", which I will be reading immanently.