Closed fknorr closed 5 months ago
This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % | ||
---|---|---|---|---|---|
include/fence.h | 2 | 4 | 50.0% | ||
src/task.cc | 14 | 16 | 87.5% | ||
src/print_graph.cc | 236 | 239 | 98.74% | ||
src/instruction_graph_generator.cc | 982 | 994 | 98.79% | ||
<!-- | Total: | 1566 | 1585 | 98.8% | --> |
Files with Coverage Reduction | New Missed Lines | % | ||
---|---|---|---|---|
src/task.cc | 1 | 87.58% | ||
<!-- | Total: | 1 | --> |
Totals | |
---|---|
Change from base Build 7976676935: | 0.7% |
Covered Lines: | 6574 |
Relevant Lines: | 6772 |
Check-perf-impact results: (7b849de16ff11660b98988ab0b032db7)
:warning: Significant slowdown (>1.25x) in some microbenchmark results: building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / expanding tree topology, building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / contracting tree topology
:heavy_plus_sign: Added microbenchmark(s): 18 individual benchmarks affected
Relative execution time per category: (mean of relative medians)
I've added an additional test suite instruction_graph_grid_tests
for grid utilities that are specific to instruction generation. This only contains tests for boxes_edge_connected
and connected_subregion_bounding_boxes
(both relevant for determining the granularity of receive / split-receive instructions), because they had insufficient coverage from the IGGen tests so far.
I've added a check that will issue a warning if Celerity generates excessive numbers of small send commands due to split_into_communicator_compatible_boxes
. This will happen e.g. when 2d-splitting a stencil on a buffer with x-dimension (dim 1) larger than 2^31 - which I hope is going to be very rare.
This is the first in a series of PRs to replace Celerity's ad-hoc local scheduling (
buffer_manager
,buffer_transfer_manager
,executor
andworker_job
) with a per-node static schedule known as the Instruction Graph (IDAG). The IDAG sits at an abstraction level below task- and command graph, and manages memory allocation, memory copies, kernel launches, and MPI operations as individual graph nodes. This moves most tracking overhead ahead in time to the scheduler and improves concurrency between these "µop" instructions at the time of execution.It also gives Celerity two new key features:
Finally, the extensible, test-friendly scheduling approach will allow us to upstream new features like collective communication or dynamic scheduling (?) with good performance and maintainability in the future.
Instruction Graph (IDAG)
The Instruction Graph is, unlike the task / command graph, not an
intrusive_graph
but references its predecessors by id. Despite what the visual representation might suggest, it only contains information strictly necessary for execution, and all debug info (dependency origins, buffer ranges, kernel names) are written to their corresponding records when recording is enabled. The graph nodes are passed directly to the executor without an additional "serialized" representation.The IDAG (and future
instruction_executor
) have no notion of buffers, the abstraction it operates on are allocations, which are roughly equivalent to pointers at execution time. Allocations can either originate from from an alloc-instruction (for buffers, reduction scratch-spaces) or from "user space" (buffer host-init pointers, fencebuffer_snapshots
). Kernel instructions contain metadata to perform accessor hydration based on these allocations and statically known offsets into them.For peer-to-peer MPI data transfers, push-commands can be compiled to their equivalent send-instructions in a rather straightforward manner, but await-push commands are more involved. Since at scheduling time it is not known which boxes the local node will receive from which peers, the IDAG must conservatively provide allocations that can receive the full pushed region en-bloc, and perform some of the tracking previously done by the
buffer_transfer_manager
in a logic called receive arbitration: To allow the receiving side toMPI_Recv
into an existing allocation without unpacking it from aframe
as it exists today, each send-instruction is accompanied by an pilot message that is transferred to the receiver as soon as the instruction is generated, which maps amessage_id
(MPI tag) to atransfer_id
and buffer subrange, providing the receiver with all parameters needed to issue anMPI_Recv
.The IDAG re-uses host-buffer allocations as send and receive allocations for all MPI operations. This avoids additional complexity for allocating and pooling staging buffers, but comes at a performance cost if the MPI operations are strided (the worst-case would be sending or receiving a strided column in a 2D-split stencil, for example). This will be addressed by a future graph-allocation method which will also enable directly receiving into device-staging buffers through RDMA.
Reductions in the IDAG are performed in two stages, where an eager node-local reduction (on
execution_command
) precedes the lazy and optional global reduction (onreduction_command
). Reductions are executed on the host and in host memory, where a scratch space is allocated to copy / receive all partial inputs before applying the reduction operator.Any resources that are tracked besides DAG allocations and host objects (e.g. reduction functors) can be deleted from the executor's tracking structures through an instruction garbage list that is attached to each horizon and epoch.
Instruction Graph Generation
The
instruction_graph_generator
(IGGen) is parameterized with an abstract system configuration which defines the number of devices, the number of memories, and system capabilities. This model supports multiple devices sharing the same memory or sharing memory with the host (although that will not initially be used by the runtime) and also systems where it is not possible to memcpy between every pair of memories - this is notably true of our own test system with two Intel Arc GPUs which do not have p2p copy functionality.The IGGen maintains state about each buffer and host object that is created on the runtime's side and naturally generates instruction nodes in topological order, i.e. a sequential execution of instructions will fulfill all internal dependencies. Multiple allocations are tracked per buffer, but they currently cannot overlap except during a resize operation. Access fronts are tracked for both reads and writes in order to compute true/anti/output-dependencies without walking the instruction graph or visiting past commands.
Whenever there are multiple producers or consumers of data (around copy, send or receive instructions), the IGGen splits these transfers to achieve maximum concurrency - i.e. no transfer instruction should / will ever introduce a synchronization point between producer- or consumer instructions that was not there before.
The
buffer_manager
contains functionality to spill a buffer to the host in order to perform a re-size where the source- and destination allocation would not both fit in device memory. This is missing from the IGGen; recognizing this spilling-condition will be the scope of a future graph-based memory allocator. We also do not currently require the feature in tests or any applications we develop (and it is very slow), so it will disappear at least in the meantime.The implementation in this PR aims to be complete, correct and maintainable. Performance of both the generator and the generated graph can probably be improved in many ways, but given the high complexity that is required as-is, I did not sacrifice readability (or much time) for performance improvements. I've nevertheless added a DAG benchmark for instruction generation.
How to Review This
This PR only contains the logic to generate, test, and print the IDAG, and does not touch the runtime's execution model yet. This is done to keep the reviews as painless as possible. See the WIP instruction-graph branch if you need more context.
The most relevant resources for understanding the graph structure and generation logic up-front are the
instruction_graph_tests
, which show the expected properties of all (or most) situations the graph must cover. Use the test executables together with render-graphs.py to get a rendering - the IDAG associates a ton of debug information with each node.