[IDAG] Switch to Instruction-Graph Scheduling

fknorr commented 1 month ago

This is the final PR in the IDAG series. It switches to the new IDAG-based runtime and drops all newly unused legacy components.

new runtime architecture

runtime now manages multiple devices, and the distr_queue API has been updated to reflect the fact.
buffer_manager, reduction_manager and host_object_manager are now gone. ID assignment for these types is now handled by the runtime directly, and all components interacting with buffers, reductions and host objects (graph generators, executor and recorders) track the relevant state themselves as instructed by notify_*_created/_destroyed introduced in #246. As a result, tasks do not need to keep strong references to buffers and host objects around anymore (lifetime_extending_state).
scheduler now generates both the command- and the instruction graph in the same thread, and maintains ownership of both structures. The CDAG is pruned at generation time (since commands never leave the scheduler thread), and the IDAG is pruned once the scheduler is notified of epoch completion. Command serialization is gone, and with it, the command is_flushed marker.
The runtime now uses live_executor (replacing legacy_executor and worker_job) together with a communicator and backend instance to execute instructions. communicator together with receive_arbiter replace buffer_transfer_manager. backend implementations replace legacy_backend, host_queue and device_queue.
Runtime destruction is now delayed until the last buffer / queue / host object is destroyed. With this change, we can stop distinguishing between a non-existing and a shut-down runtime. For backwards compatibility, ~distr_queue will continue to epoch-synchronize. runtime asserts that non-thread-safe functions are only called from the application thread, which will trigger onaccidental value-captures of buffers / host objects into host tasks.
Reductions are now available on all SYCL implementations since hipSYCL has added support. This allows us to drop a lot of #ifdefs in tests and frontend code.
log_context, which was only used by worker_job, is removed.
vendor/ctpl, which was only used by host_queue, is removed.

Since one node now addresses multiple GPUs, scheduling becomes more expensive (IDAG generation is maybe ~4x as expensive as CDAG generation). This will be visible in benchmark results.

github-actions[bot] commented 1 month ago

Check-perf-impact results: (877795252c9a57f7b343e4747db6ca4f)

:warning: Significant slowdown (>1.25x) in some microbenchmark results: 7 individual benchmarks affected
:heavy_plus_sign: Added microbenchmark(s): 48 individual benchmarks affected
:heavy_minus_sign: Removed microbenchmark(s): 48 individual benchmarks affected

Relative execution time per category: (mean of relative medians)

command-graph : 0.94x
graph-nodes : 1.01x
grid : 1.04x
instruction-graph : 1.02x
scheduler : new :star2:
system : 3.30x :warning:
task-graph : 1.08x

coveralls commented 1 month ago

Pull Request Test Coverage Report for Build 10213951594

Details

368 of 368 (100.0%) changed or added relevant lines in 19 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage increased (+1.8%) to 94.7%

Files with Coverage Reduction	New Missed Lines	%
src/task.cc	1	92.06%
<!--	Total:	1		-->

Totals
Change from base Build 10143808743:	1.8%
Covered Lines:	6564
Relevant Lines:	6700

💛 - Coveralls

github-actions[bot] commented 1 month ago

Check-perf-impact results: (f2e639c8a97550e58528a410c1b8586d)

:warning: Significant slowdown (>1.25x) in some microbenchmark results: 8 individual benchmarks affected
:heavy_plus_sign: Added microbenchmark(s): 48 individual benchmarks affected
:heavy_minus_sign: Removed microbenchmark(s): 48 individual benchmarks affected

Relative execution time per category: (mean of relative medians)

command-graph : 1.12x :warning:
graph-nodes : 1.02x
grid : 1.03x
instruction-graph : 1.15x :warning:
scheduler : new :star2:
system : 3.44x :warning:
task-graph : 1.24x :warning:

Edit: We inadvertently disabled mimalloc. All hail the benchmark suite!

github-actions[bot] commented 1 month ago

Check-perf-impact results: (2908f97f836fd2def14c3429cd4d61ac)

:warning: Significant slowdown (>1.25x) in some microbenchmark results: 5 individual benchmarks affected
:rocket: Significant speedup (<0.80x) in some microbenchmark results: generating large command graphs for N nodes - 1 / chain topology
:heavy_plus_sign: Added microbenchmark(s): 48 individual benchmarks affected
:heavy_minus_sign: Removed microbenchmark(s): 48 individual benchmarks affected

Relative execution time per category: (mean of relative medians)

command-graph : 0.89x :rocket:
graph-nodes : 1.03x
grid : 1.00x
instruction-graph : 0.99x
scheduler : new :star2:
system : 3.24x :warning:
task-graph : 0.98x

celerity / celerity-runtime