[handshake-simulator] Design proposal and tracking issue

lucas-rami commented 2 months ago

This is a design proposal and tracking issue for the implementation of the Handshake-level dataflow circuit simulator, or Handshake simulator for short. While we already have an experimental version of this simulator on the repository, the design/implementation effort required to get it to a correct and usable state is almost equivalent to a full rewrite, therefore this issue assumes that we start from scratch.

Goal & Requirements

The goal is to give us the ability to simulate dataflow circuits from the Handshake-level IR alone, without going to RTL through the backend and then simulating the resulting circuit with a tool like ModelSim. The former offers several benefits over the latter (in no particular order).

External tool independence. We would no longer be reliant on potentially-licensed external tools (e.g., Modelsim) or on RTL components under IP (e.g., floating-point operators) to simulate our circuits.
Simulation speed. We do not need a full C++-level hardware simulator, only something that can simulate dataflow circuits that are representable in Handshake. This narrower scope should allow us to make simplifying assumptions that should result in faster simulation times over RTL simulation.
Analysis power. We could very easily peek into our circuits' behavior under simulation, enabling the extraction of the internal circuit state at any point of the execution, from which statistics may be derived and used for further circuit optimizations/transformations. It would be significantly harder to extract and exploit the same information from the waveform produced by an RTL simulation tool due to the simulation's granularity.
Developer experience. We would allow developers to assess the impact of new components on their circuits' throughput/latency before having to write any RTL. Developers could benchmark and quickly iterate on their component's internal implementation using the Handshake simulator and, only when they obtain satisfying results, finally implement their component in a hardware description language.

In order to ensure the simulator's usefulness to the community, we establish a few key design requirements.

Cycle-accuracy. The simulator will work by associating an execution model (written in C++ and intertwined with MLIR's API) to each operation in the input IR. These models should be defined in a way that they can be made cycle-accurate with respect to an existing or envisioned RTL implementation.
Simplicity. It should be as easy as possible to define execution models for dataflow component, ideally always easier than in RTL. The code should look clean and be easily understandable by Dynamatic users, with most of the complexity being abstracted away in custom data-structures and APIs.
Flexibility. Modifying an existing execution model, replacing one entirely, or adding a new one should be trivial since we envision this as a very frequent simulator use case.
Instrumentation. It should be possible to extract, summarize, and analyze every aspect of the internal state of a circuit under simulation.

For completeness, we also list some of the features that we do not care to have in the simulator, at least at the moment.

Combinational delays. We do not care about combinational delays i.e., signal propagation is instantaneous.
Non-binary states. We do not care care to model anything beyond the basic two-state logic (0 and 1). For example, we will not handle undefined or high-impedance states.

Implementation roadmap

In this section we try to break down the implementation of the simulator from scratch into multiple manageable steps. This will most likely be edited a lot throughout the simulator's implementation. Issues, PRs, and commits that relate to each point will be referenced here. To keep individual contributions manageable, each pull request must cover at most the content of one of the subsections below. Smaller pull requests for sub-tasks in each subsection are allowed and encouraged if they make sense from a development perspective.

Execution models

This is likely the step that requires the most careful thought and the most design effort as it is what Dynamatic users are the most likely to interact with.

[x] Define and implement the API for execution models, which are in charge of simulating each component's RTL implementation. This includes defining the API for querying/modifying the circuit's state through each operation's operands/results (533da29).

I really think it is key that we get this right and make the life of the future user as easy as possible. I have some syntax is mind that I think will be very nice to work with, it is inspired by the way MLIR handles rewrite patterns. Below is some sample C++-like pseudocode for what I envision.

/// Abstract parent class for execution models, templated
/// by the operation type which it simulates and the type
/// of a data-structure the component uses to maintain its
/// internal state (and which may be `void`).
template <typename Op, typename State>
class ExecutionModel {
  /// May be called by concrete execution models to
  /// determine whether we are on a rising edge of the clock.
  final bool isClockRisingEdge() { ... }

  /// The execution function that simulates the operation.
  virtual void exec(
    Op op /* the simulated MLIR operation */,
    State &state /* the operation's current internal state*/,
    InputReader &reader /* a way to query for the state of the operation's operands */,
    OutputWriter &writer /* a way to modify the state of the operation's outputs */
  ) = delete;
}

/// Example state and model for a multiplexer.
class MuxState { ... };
class MuxModel : public ExecutionModel<handshake::MuxOp, MuxState> {
  void exec(Op op, State &state, InputReader &reader, OutputWriter &writer) { ... }
}

[x] Create a way to associate each operation in the input IR to an execution model. Eventually we will want something very customizable. Initially, however, we can just statically associate execution models to operations based on their type (533da29).

This time, some pseudocode inspired by how one creates operations in MLIR.

// The Handshake function to simulate
handshahe::FuncOp funcOp = ...;

HandshakeSimulator simulator(funcOp);
for (Operation& op : funcOp.getOps()) {
  // In reality we would check for all supported operation types
  // with llvm::TypeSwitch most likely
  if (auto muxOp = dyn_cast<handshake::MuxOp>(op)) {
    // Constructs an instance of MuxModel which will be associated
    // to muxOp during simulation of funcOp
    simulator.registerModel<MuxModel>(muxOp);
  }
}

[x] Implement a few concrete execution models to gain some confidence that our design is good enough to express a range of possible RTL implementations. For example, and for each of the following categories, we could implement the execution model of one operation whose RTL implementation fits the description (533da29).
- A simple purely combinational operation (i.e., without clock).
- A simple sequential operation (i.e., which does things on the clock rising edge).
- An operation which "instantiates" another one within it (e.g., an adder instantiating a join inside).

Event-driven simulation loop

Once we have execution models, we can implement the simulator's core, in charge of invoking them as needed throughout the circuit's execution to simulate all of the combinational and sequential logic.

[x] At the center of the simulator is an event-driven simulation loop, which is tasked to invoke the execution model of each operation in the IR every time there is a change in the circuit that may affect the corresponding MLIR operation's internal state or outputs. There are two slightly different types of changes to support (533da29).
- A state change in one of the operation's operand.
- A rising edge of the clock, which may trigger specific logic inside each execution model.
The event-driven simulator may be seen as two nested loops that iterate (potentially infinitely) as long as there is a change in the circuit.
- Each iteration of the outer loop simulates an entire clock cycle, starting with the rising edge of the clock. The execution model of each IR operation should be called to simulate any sequential logic (even if just to update the model's internal state) and all state changes in the circuit recorded.
- Each iteration of the inner loop simulates the propagation of signals within a clock cycle. Initially, the execution model of operations whose internal state or outputs may be affected by a state change that happened at the clock's rising edge should be invoked. Any state change that those invocations may themselves create should recursively trigger the execution of potentially affected operations. This repeats until the circuit reaches a stable state. In the future it may be useful to detect unstable states here. Initially we can just set a large yet reachable iteration limit that will trigger a simulation failure instead of letting the simulator run indefinitely.
[x] Implement the "testbench wrapper" around the simulated Handshake function, which should provide input tokens to the circuit and be able to read the circuit's output tokens at the end of simulation (533da29).
[x] Test the implementation so far with very simple circuits, using as few execution models as possible (we may need to define some in addition to the one we tested with initially to simulate on semi-interesting circuits). At this point the simulation flow should work for operations whose execution models we implemented (533da29).

We should now be able to simulate using a simple API like the following.

// As initialized in the subsection above
HandshakeSimulator simulator = ...; 

// <register simulation listeners here> (see subsection below)

SimulationArguments simArgs = ...;
SimulationResutls simResults = ...;
simulator.simulate(simArgs... /*function arguments*/, &simResults /*function results */);

Instrumentation capabilities

As mentioned, one of the Handshake simulator's key benefits is the ability to extract information related to the circuit's state throughout simulation.

[ ] Define "simulation hooks" that allow users to register listeners for specific events happening in the simulation. Listeners are callback functions taking some relevant subset of the circuit state as input.

Here are a couple attempts at defining hooks for things that a user may care about.

// Trigger the callback on each rising edge of the clock cycle
simulator.onClockRisingEdge([&](const CircuitState& state) { ... });

// ----

// A value in the Handshake function being simulated
Value someValue = ...;
// Trigger the callback whenever the state associated to the SSA value changes
simulator.onStateChange(someValue, [&](ValueState oldState, ValueState newState) { ... });

// ----

// A mux in the Handshake function being simulated
handshake::MuxOp muxOp = ...;
// Trigger the callback whenever the state of any of the operation's results changes
using OperandValues = DenseMap<OpOperand*, ValueState>;
using ResultValues = DenseMap<OpResult, ValueState>;
simulator.onStateChange(muxOp, [&](ResultValues oldOutputs, OperandValues newInputs, ResultValues newOutputs { ... });

Writing all execution models

Finally, we will need to go through the cumbersome task of implementing execution models for all our dataflow components.

[x] Implement execution models for all dataflow components but the LSQ (separated because of its complexity).
[ ] Implement an execution model for the LSQ.

paolo-ienne commented 1 month ago

I confess that I have not managed to read the whole of this. Yet I would like to add a couple of comments (not necessarily in contradiction with the above): (1) I am not convinced of the usefulness or of the convenience for users to model the behaviour of components at higher levels than RTL--because this will always be in addition to RTL modelling and the consistence of the two models tricky. (2) RTL-level components do not seem exaggeratedly complex compared to other modelling except for math operations--in fact, pretty straightforward. Maybe some fast RTL simulators may be able to leverage a high-level description of math operations without gate-level implementations. (3) Fast RTL-level simulators have similar goals as this project (zero delay, binary simulation) and Verilator seems pretty strong in this arena; maybe we should think this project more in the shape of "how can we make the best use of Verilator for high-speed simulation of our dataflow circuits".

Jiahui17 commented 1 month ago

I think this sort of simulation does more or less the same thing as SystemC (Andrea mentioned the concern after the student presented the first version of the simulator).

https://github.com/accellera-official/systemc

So why not create an export-sc tool and simulate everything in SystemC?

EPFL-LAP / dynamatic