[IDAG] Generate and test Instruction Graph

fknorr commented 6 months ago

This is the first in a series of PRs to replace Celerity's ad-hoc local scheduling (buffer_manager, buffer_transfer_manager, executor and worker_job) with a per-node static schedule known as the Instruction Graph (IDAG). The IDAG sits at an abstraction level below task- and command graph, and manages memory allocation, memory copies, kernel launches, and MPI operations as individual graph nodes. This moves most tracking overhead ahead in time to the scheduler and improves concurrency between these "µop" instructions at the time of execution.

It also gives Celerity two new key features:

Multi-device support: Current Celerity manages one device / GPU per node (MPI rank), which hurts inter-device communication performance on systems with more than one GPU per physical system. Multi-device support has been attempted before but became increasingly difficult to implement correctly around features such as reductions.
Multiple backing allocations per buffer: Celerity currently has to re-size a buffer's backing allocation to cover every accessed subrange, even if disjoint. The IDAG permits multiple (non-overlapping) allocations instead. In the future, this can become the basis for advanced graph-based memory allocators.

Finally, the extensible, test-friendly scheduling approach will allow us to upstream new features like collective communication or dynamic scheduling (?) with good performance and maintainability in the future.

idag-meme

Instruction Graph (IDAG)

The Instruction Graph is, unlike the task / command graph, not an intrusive_graph but references its predecessors by id. Despite what the visual representation might suggest, it only contains information strictly necessary for execution, and all debug info (dependency origins, buffer ranges, kernel names) are written to their corresponding records when recording is enabled. The graph nodes are passed directly to the executor without an additional "serialized" representation.

The IDAG (and future instruction_executor) have no notion of buffers, the abstraction it operates on are allocations, which are roughly equivalent to pointers at execution time. Allocations can either originate from from an alloc-instruction (for buffers, reduction scratch-spaces) or from "user space" (buffer host-init pointers, fence buffer_snapshots). Kernel instructions contain metadata to perform accessor hydration based on these allocations and statically known offsets into them.

For peer-to-peer MPI data transfers, push-commands can be compiled to their equivalent send-instructions in a rather straightforward manner, but await-push commands are more involved. Since at scheduling time it is not known which boxes the local node will receive from which peers, the IDAG must conservatively provide allocations that can receive the full pushed region en-bloc, and perform some of the tracking previously done by the buffer_transfer_manager in a logic called receive arbitration: To allow the receiving side to MPI_Recv into an existing allocation without unpacking it from a frame as it exists today, each send-instruction is accompanied by an pilot message that is transferred to the receiver as soon as the instruction is generated, which maps a message_id (MPI tag) to a transfer_id and buffer subrange, providing the receiver with all parameters needed to issue an MPI_Recv.

The IDAG re-uses host-buffer allocations as send and receive allocations for all MPI operations. This avoids additional complexity for allocating and pooling staging buffers, but comes at a performance cost if the MPI operations are strided (the worst-case would be sending or receiving a strided column in a 2D-split stencil, for example). This will be addressed by a future graph-allocation method which will also enable directly receiving into device-staging buffers through RDMA.

Reductions in the IDAG are performed in two stages, where an eager node-local reduction (on execution_command) precedes the lazy and optional global reduction (on reduction_command). Reductions are executed on the host and in host memory, where a scratch space is allocated to copy / receive all partial inputs before applying the reduction operator.

Any resources that are tracked besides DAG allocations and host objects (e.g. reduction functors) can be deleted from the executor's tracking structures through an instruction garbage list that is attached to each horizon and epoch.

Instruction Graph Generation

The instruction_graph_generator (IGGen) is parameterized with an abstract system configuration which defines the number of devices, the number of memories, and system capabilities. This model supports multiple devices sharing the same memory or sharing memory with the host (although that will not initially be used by the runtime) and also systems where it is not possible to memcpy between every pair of memories - this is notably true of our own test system with two Intel Arc GPUs which do not have p2p copy functionality.

The IGGen maintains state about each buffer and host object that is created on the runtime's side and naturally generates instruction nodes in topological order, i.e. a sequential execution of instructions will fulfill all internal dependencies. Multiple allocations are tracked per buffer, but they currently cannot overlap except during a resize operation. Access fronts are tracked for both reads and writes in order to compute true/anti/output-dependencies without walking the instruction graph or visiting past commands.

Whenever there are multiple producers or consumers of data (around copy, send or receive instructions), the IGGen splits these transfers to achieve maximum concurrency - i.e. no transfer instruction should / will ever introduce a synchronization point between producer- or consumer instructions that was not there before.

The buffer_manager contains functionality to spill a buffer to the host in order to perform a re-size where the source- and destination allocation would not both fit in device memory. This is missing from the IGGen; recognizing this spilling-condition will be the scope of a future graph-based memory allocator. We also do not currently require the feature in tests or any applications we develop (and it is very slow), so it will disappear at least in the meantime.

The implementation in this PR aims to be complete, correct and maintainable. Performance of both the generator and the generated graph can probably be improved in many ways, but given the high complexity that is required as-is, I did not sacrifice readability (or much time) for performance improvements. I've nevertheless added a DAG benchmark for instruction generation.

How to Review This

This PR only contains the logic to generate, test, and print the IDAG, and does not touch the runtime's execution model yet. This is done to keep the reviews as painless as possible. See the WIP instruction-graph branch if you need more context.

The most relevant resources for understanding the graph structure and generation logic up-front are the instruction_graph_tests, which show the expected properties of all (or most) situations the graph must cover. Use the test executables together with render-graphs.py to get a rendering - the IDAG associates a ton of debug information with each node.

May your eyes bleed less from reviewing than my hands did implementing this.

coveralls commented 6 months ago

Pull Request Test Coverage Report for Build 8329557282

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

1566 of 1585 (98.8%) changed or added relevant lines in 19 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage increased (+0.7%) to 94.641%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
include/fence.h	2	4	50.0%
src/task.cc	14	16	87.5%
src/print_graph.cc	236	239	98.74%
src/instruction_graph_generator.cc	982	994	98.79%
<!--	Total:	1566	1585	98.8%	-->

Files with Coverage Reduction	New Missed Lines	%
src/task.cc	1	87.58%
<!--	Total:	1		-->

Totals
Change from base Build 7976676935:	0.7%
Covered Lines:	6574
Relevant Lines:	6772

💛 - Coveralls

github-actions[bot] commented 6 months ago

Check-perf-impact results: (7b849de16ff11660b98988ab0b032db7)

:warning: Significant slowdown (>1.25x) in some microbenchmark results: building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / expanding tree topology, building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / contracting tree topology
:heavy_plus_sign: Added microbenchmark(s): 18 individual benchmarks affected

Relative execution time per category: (mean of relative medians)

command-graph : 1.05x
graph-nodes : 1.00x
grid : 1.00x
instruction-graph : new :star2:
scheduler : 1.07x
system : 1.02x
task-graph : 1.01x

fknorr commented 6 months ago

I've added an additional test suite instruction_graph_grid_tests for grid utilities that are specific to instruction generation. This only contains tests for boxes_edge_connected and connected_subregion_bounding_boxes (both relevant for determining the granularity of receive / split-receive instructions), because they had insufficient coverage from the IGGen tests so far.

fknorr commented 6 months ago

I've added a check that will issue a warning if Celerity generates excessive numbers of small send commands due to split_into_communicator_compatible_boxes. This will happen e.g. when 2d-splitting a stencil on a buffer with x-dimension (dim 1) larger than 2^31 - which I hope is going to be very rare.

celerity / celerity-runtime