[Handshake] Adding completion signals to store operations

rpirayadi commented 1 week ago

Currently, the handshake::LSQStoreOp and handshake::MCStoreOp do not provide a completion signal. More precisely, they produce only two results (i.e. address and data value which are fed to handshake::LSQOp and handshake::MemoryControllerOp recpectively). Generally speaking, for keeping track of memory consistency, it is nice to have a completion signal. Moreover, I would like to sequentialize load and stores without using the LSQ, so I need the store to produce this signal. This way, I can make sure the successor doesn't start before the store finishes.

The challenges I believe exist for adding this signal are listed below.

The completion signal should be produced in handshake::LSQOp and handshake::MemoryControllerOp for every store connected to them.
There should be a sink inserted for the output of store operation anywhere this signal is not used (which is probably everywhere except my use case).
The new signal should also be taken into consideration when translating handshake to HW.

I would appreciate any insights related to the procedure of adding such signal to store operations from @lucas-rami or anyone else who has a comment.

small side note: As mentioned here handshake::LSQStoreOp and handshake::MCStoreOp may be merged into one handshaek::StoreOp and the signal could be add to this operation.

AyaElAkhras commented 1 week ago

I totally get why this completion signal is necessary in your use case, but I think it is important even beyond your use case. Specifically, I think such a completion signal should be, somehow, involved in the logic of calculating the mem_end signal explained in the Memory Controls section here https://github.com/EPFL-LAP/dynamatic/blob/main/docs/Specs/CircuitInterface.md#start-to-end-path

Currently, this mem_end is calculated using the network of CMerges once the sequential control flow has reached the end. Although perfectly correct, the network of CMerges has proven to dramatically limit the performance of the circuit.

Apologies if I'm adding noise to Rouzbeh, but I think this discussion is not too irrelevant and it would be useful to get insights about how realistic it is to have this completion signal and use it to calculate the mem_end as well. I wonder if this idea has been ever discussed?

paolo-ienne commented 6 days ago

Currently, this mem_end is calculated using the network of CMerges once the sequential control flow has reached the end. Although perfectly correct, [...]

Is it correct? If mem_start and mem_end are there to connect correctly different circuits or a circuit and the host, then, before the successor circuit/host can do a load, I need to be sure that all stores of the predecessor have completed. Reaching the end of the control flow graph may be a reasonable approximation of that, but not quite: indeed, we need that (to be sure that there will be no more stores), but we also need to know that the last store has completed.

I suppose today mem_end, if it is implemented as @AyaElAkhras suggests, is a simple approximation good for most practical purposes (or even correct with BRAMs, maybe) but not truly correct.

AyaElAkhras commented 6 days ago

Is it correct? If mem_start and mem_end are there to connect correctly different circuits or a circuit and the host, then, before the successor circuit/host can do a load, I need to be sure that all stores of the predecessor have completed. Reaching the end of the control flow graph may be a reasonable approximation of that, but not quite: indeed, we need that (to be sure that there will be no more stores), but we also need to know that the last store has completed.

I suppose today mem_end, if it is implemented as @AyaElAkhras suggests, is a simple approximation good for most practical purposes (or even correct with BRAMs, maybe) but not truly correct.

Yes, I see your point: not truly accurate but a good approximation for our memory system and benchmarks.

My point was to say that the completion signals are more fundamental even beyond Rouzbeh's use case, and I think they can be used to get rid of the network of CMerges completely by adding a minimal network circulating such signals to calculate the mem_end. I can make my idea more concrete and show some examples at some point, once we have the completion signals.

lucas-rami commented 5 days ago

Thanks for bringing this up, I think this is a good idea and I only have one small implementation concern. Perhaps some people care deeply about having store ports which do not have this completion signal. This does not include me and I don't know why somebody would care very much given that one can simple ignore it, but I'm CC-ing people who potentially may care just in case (please shout if you do!) @Jiahui17 @Carmine50 @murphe67 @qianxu1998.

If nobody minds the extra signals, you handshake::StoreOp's operands/results would now look like this (in Tablegen).

let arguments = (ins IntChannelType:$address, ChannelType:$data, ControlType:$memDone);
let results = (outs IntChannelType:$addressResult, ChannelType:$dataResult, ControlType:$memDoneResult);

If people mind you will have to make these extra argument/result optional.

let arguments = (ins IntChannelType:$address, ChannelType:$data, Optional<ControlType>:$memDone);
let results = (outs IntChannelType:$addressResult, ChannelType:$dataResult, Optional<ControlType>:$memDoneResult);

You would also need an operation verifier to make sure that either both the argument and result signals exist or none exist.

For the challenges you outline.

The completion signal should be produced in handshake::LSQOp and handshake::MemoryControllerOp for every store connected to them.

Yes. Now keep in mind that our memory interfaces' RTL implementations are generally terrible so if you are going to add this your choices are either to stack your change half-hazardly on top of the mess or first try to somewhat fix the mess to build on stronger foundations. There are two things that are especially bad.

Ports to/from the circuit are not "proper handshake". The literal connectivity is handshake, but "handshake's invariants" are not honored in unclear ways (which is why we have to hide buffers inside the memory access ports for example).
Ports to/from external memories are just not handshake at all. It's basically a simplified BRAM interface that assumes fixed latencies.

Fixing this is most likely not trivial, and my engineering skills are quite limited in this area. I would be very appreciative if this was fixed, but I understand this may not be a priority.

There should be a sink inserted for the output of store operation anywhere this signal is not used (which is probably everywhere except my use case).

Handled automatically by the fork/sink materialization pass, nothing to do here.

The new signal should also be taken into consideration when translating handshake to HW.

Handled automatically as well. You would just need to modify the RTL component's implementation and possibly the RTL config as well if the extra operand/result are optional (in that case there is also a one-line change to make in HandshakeToHW).

Regarding the mem_end question, I let you all decide what makes sense. I am obviously ok with whatever is functionally correct so I have nothing to add. Let me just acknowledge that the implementation of the mem_end logic inside our memory interfaces was half-baked by yours truly during my coffee-fueled last official weeks on the project, so I would not give it any kind of confidence above the bare minimum. Surprisingly, it seems to work sort of ok (as in nothing explodes) but it is for sure not actually correct; as soon as the control network signals that the last BB is reached and there are no pending requests the signal is asserted. This may be accurate in 90% of our benchmarks, but only by sheer luck. If someone decides to fix out interfaces' idiosyncrasies that will need a rework as well, and at that point it will be possible to generate this signal reliably.

murphe67 commented 4 days ago

I have a project I was going to pitch to Lana soon (possibly overlapping with what @rpirayadi is already doing) that this would benefit, so at first glance it seems like the work involved to make it possible would be worth it for several reasons.

Jiahui17 commented 3 days ago

Hi all! I like the plan and I just want to add some clarifications here:

...indeed, we need that (to be sure that there will be no more stores), but we also need to know that the last store has completed.

@paolo-ienne This is correct, now it is solved in a slightly hacky way: In the default CfToHandshake-generated circuits, every BB with >= 1 store port to a memory controller will fork a control token (from the network of CMerges) to register the number of upcoming pending stores in a counter in the memory controller (counted down whenever a store is completed). The program returns if all 3 criteria hold:

Return value accepted (ignored if the function is void),
All pending stores are completed (i.e., counter is zero),
A returning BB has been reached.

Using the completion signal seems more natural as it makes the return mechanism transparent to the circuit (now the circuit side has no clue how many stores it has to wait before it can return); this is also necessary for a fast-token circuit not to terminate too early (@AyaElAkhras, @pcineverdies ), so this change is a win-win. But for now, I can't think of how to handle these completion signals...

rpirayadi commented 22 hours ago

I plan to start adding the completion signal to the new handshake::StoreOp added recently in here.

This is what @lucas-rami said earlier.

Yes. Now keep in mind that our memory interfaces' RTL implementations are generally terrible so if you are going to add this your choices are either to stack your change half-hazardly on top of the mess or first try to somewhat fix the mess to build on stronger foundations. There are two things that are especially bad.

For now, I do not plan to approach the problem in the most appropriate way, so, I won't attempt to fix the problems in RTL. What I plan to do is to add a small circuit to the current RTL implementation of the memory controller (and if possible the LSQ) that simply sends the completion signal with a single-cycle delay after receiving the inputs which is correct assuming we are using BRAMs. This way, I can modify the implementation at higher levels and have the completion signal in the MLIR.

By doing this, there is the advantage that one can check whether there are any issues in adding the completion signal or not. It is of course possible to substitute this implementation with a more accurate one later.

EPFL-LAP / dynamatic

[Handshake] Adding completion signals to store operations #191