Closed fknorr closed 4 months ago
Check-perf-impact results: (7b849de16ff11660b98988ab0b032db7)
:question: No new benchmark data submitted. :question:
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % | ||
---|---|---|---|---|---|
src/utils.cc | 0 | 2 | 0.0% | ||
src/out_of_order_engine.cc | 178 | 190 | 93.68% | ||
<!-- | Total: | 187 | 201 | 93.03% | --> |
Totals | |
---|---|
Change from base Build 9205581023: | -0.07% |
Covered Lines: | 7053 |
Relevant Lines: | 7266 |
This is the third PR in the Instruction Graph series and introduces the new executor state machine, dubbed the out-of-order-engine, as a stand-alone testable component.
Much like the current executor, the out-of-order engine keeps track of all instructions that have not yet completed and decides which instructions to schedule onto which backend resources at what time, and receives back information on which instructions have already completed. This will allow us to keep the instruction executor free of most instruction state tracking.
Unlike the current approach, this new form of scheduling is based on a definition of backends which maintain an array of in-order thread queues for host work and in-order SYCL queues for device submissions. This allows the engine to omit host / executor loop round-trips between consecutive GPU / CPU loads by scheduling successors onto the same in-order queues to implicitly fulfil dependencies, and thus hide SYCL and CUDA kernel launch latency
In the future I would like to improve this further with support to submit instructions with dependencies on multiple queues / devices earlier by waiting on in-flight SYCL events.
This PR still does not touch the actual executor yet, allowing us to see the unit test coverage of the engine in isolation.