callumforrester commented 10 months ago

Problem

Simulating FPGAs is difficult. Tickit differs from the FPGA simulation on which was based in the following ways.

Tick Scope

Tickit does not allow cyclic graphs because it considers a tick to be a single, instantaneous propagation of the entire device from the lowest down nodes that require wakeup. The following is a valid graph showing which nodes are visited in which ticks (t0, t1 and t2):

tickit-ticks

If a cycle were inserted anywhere then t0 would never end because the tick only ends when the output propagates all the way to D.

The FPGA design allows cycles by having a reduced scope for a tick, only ticking each subsequent node:

fpga-ticks(2)

Temporal Relevance

In tickit a tick is assumed to be instantaneous, even if it is propagating a signal through a large and complex graph, no simulation time passes. An arbitrary number of ticks can also take place in zero simulated time. In the below example, only tick 3 takes place after 0 nanoseconds:

many-ticks(1)

Time is much more important to FPGA ticks, a delay is enforced between them and "event x must happen n ticks (clock cycles) later than event y" is a valid use case.

The example below shows a simple traversal with no wakeups:

fpga-and-tickit-times(1)

The graph below shows each propagation stage (tock) of each tick. A tock is defined as a single transfer of output of one node to input of another node, or an initial trigger of a node at the beginning of a tick.

Tick	Tock	Time (ns)	Node (Tickit)	Node (FPGA))
0	0	0	A	A
0	1	0	B	B
0	2	0	C	_
0	3	0	D	_
1	0	8	_	C
1	1	8	_	D
2	0	16	_	D

This implies one more visit to D than in the tickit model, showing a fundamental difference that cannot be achieved by inserting an artificial delay.

Causality

Both of the above examples show how tickit treats causality differently to the FPGA model. It traverses the whole graph instantly to propagate the consequences of events before any time can pass that introduces new events. Thus something at one end of a graph affects something at the other end in 0 sim time.

Proposed Solutions

Hybrid Schedulers

Write a child scheduler that works like the FPGA simulation. Keep the existing master scheduler for wiring all devices outside of the FPGA simulation and reconciling the outputs. See example below:

fpga-scheduler-bad

Several issues with this approach are illustrated here:

The two schedulers each have their own concept of ticks, which is uncontroversial and how the current NestedScheduler works, but they also each have their own separate concepts of simulated time.
The FPGA takes a number of nanoseconds to complete its tasks, while the master still treats its output as instantaneous and passes it onto the detectors.
From the FPGA's point of view the detectors should each be triggered at a different time: 16ns, 8ns and 0ns. From the master point of view they are all triggered simultaneously. There is no clear way to tell who is right or if causality is affected. To the outside world (via device adapters) all detectors are triggered simultaneously.

These are not necessarily showstoppers as long as we accept this potential inaccuracy in the simulation.

Different Master Scheduler

Write a new master scheduler to be used for all simulations involving FPGAs, which makes everything time-sensitive. In this case, all ticks are propagated in the FPGA style. All node dependencies must have a delay of at least 1ns. There is still a separate scheduler for controlling FPGA simulations for performance reasons, but it and the master share a concept of simulated time.

fpga-scheduler-full

In this version the detectors are all triggered at different times and in the order the FPGA requires. The disadvantage is that these simulations are more restrictive and less generic. They can only accept simulated devices that have a concept of time and all causality is based around the FPGA, which makes it more difficult to simulate non-discrete-time entities such as the behaviour of the beam. That does, however, optimise tickit for the hardware triggered scanning use case.

callumforrester commented 10 months ago

@coretl would be interested in your opinions

callumforrester commented 10 months ago

For solution one: We need to wait until the FPGA updates all outputs and it's hard to know when outputs will arrive. Outputs may not be related to inputs.

callumforrester commented 10 months ago

More info from a call with @coretl

Propagation of Changes Outside the FPGA

Solution 2 makes sense, but we should be careful about propagation of consequences. See the example below where the detectors must be kept aware of the motor positions (for example, to generate simulated data).

fpga-clock(2)

Detector 1 is first triggered directly by X in t1, which is when it receives an update of X's position. It will then have to cache the position until it receives a trigger in t3, when it is actually supposed to take a frame. There is a danger that X has moved in that time but if it does it will generate extra ticks which will propagate to the detector and allow it to update its cached value.

The case of Y to detector 2 is more complicated, with no link from Y updating to the PandA. The PandA just generates triggers every 4ns using a clock. That means that if Y is not updated, the clock could trigger the detector for many ticks. The detector would have an old value of Y and would produce the same data for each tick. Again, this should not matter because if Y is updated the detector should cache a new value.

Data Striation

This system works, but it does showcase where the event-driven nature of tickit clashes with the sampling-driven nature of a real experiment. If the FPGA triggers detector 2 at a higher frequency than Y is updated, detector 2 will contain striated data because it will use many cached versions of Y.

str-data(1)

This is okay if the trigger frequency of each device reflects the real world, but it may not due to CPU constraints. A motor cannot be simulated at the true rate of a pmac, for example. This is the problem that tickit's original zero-time-ticks are meant to solve.

The data striation represents the reason why this design makes the simulation more constrained/lower fidelity. The simulation is still useful even if less general-purpose.

Possible Solution

One possible solution is to add a mechanism for passing "curves" rather than scalar values to the detector, so it can evaluate Y on its own until Y updates it with a new curve. A curve could be a function against time or a lookup table. This may be a useful additional feature to add once we have this working.

garryod commented 9 months ago

Why doesn't the FPGA Scheduler just ask to be updated every 8ns until it is in a stable state, with knowledge of which blocks inside it must compute at the next step stored internally? This way the FPGA Scheduler doesn't need to "own" time and simulation accuracy is preserved

callumforrester commented 9 months ago

Not sure if there's a way to reliably detect that it is in a stable state, but I defer to @coretl

coretl commented 9 months ago

There is not a way to know when there is a stable state. An input to a PandA ripples through a series of blocks and may or may not produce an output. One thing we discussed is terminating the tick whenever it gets to a PandA which would then schedule a callback for 8ns time until it was complete. This left us wondering if there was any value in the graph traversal of the standard scheduler, and whether it made more sense to make every transition take time, and do everything in the FPGA way

garryod commented 9 months ago

Iirc, we had this conversation way back at the start and decided that adding delays to wires was going to cause far more issues than it solved. Surely if none of the blocks within the PandA change their state during a step then it would be considered stable?

DiamondLightSource / tickit

Support FPGA Simulations #202

Problem

Tick Scope

Temporal Relevance

Causality

Proposed Solutions

Hybrid Schedulers

Different Master Scheduler

Propagation of Changes Outside the FPGA

Data Striation

Possible Solution