Open rpirayadi opened 1 week ago
I totally get why this completion signal is necessary in your use case, but I think it is important even beyond your use case. Specifically, I think such a completion signal should be, somehow, involved in the logic of calculating the mem_end
signal explained in the Memory Controls section here https://github.com/EPFL-LAP/dynamatic/blob/main/docs/Specs/CircuitInterface.md#start-to-end-path
Currently, this mem_end
is calculated using the network of CMerges once the sequential control flow has reached the end. Although perfectly correct, the network of CMerges has proven to dramatically limit the performance of the circuit.
Apologies if I'm adding noise to Rouzbeh, but I think this discussion is not too irrelevant and it would be useful to get insights about how realistic it is to have this completion signal and use it to calculate the mem_end
as well. I wonder if this idea has been ever discussed?
Currently, this mem_end is calculated using the network of CMerges once the sequential control flow has reached the end. Although perfectly correct, [...]
Is it correct? If mem_start
and mem_end
are there to connect correctly different circuits or a circuit and the host, then, before the successor circuit/host can do a load, I need to be sure that all stores of the predecessor have completed. Reaching the end of the control flow graph may be a reasonable approximation of that, but not quite: indeed, we need that (to be sure that there will be no more stores), but we also need to know that the last store has completed.
I suppose today mem_end
, if it is implemented as @AyaElAkhras suggests, is a simple approximation good for most practical purposes (or even correct with BRAMs, maybe) but not truly correct.
Is it correct? If
mem_start
andmem_end
are there to connect correctly different circuits or a circuit and the host, then, before the successor circuit/host can do a load, I need to be sure that all stores of the predecessor have completed. Reaching the end of the control flow graph may be a reasonable approximation of that, but not quite: indeed, we need that (to be sure that there will be no more stores), but we also need to know that the last store has completed.I suppose today
mem_end
, if it is implemented as @AyaElAkhras suggests, is a simple approximation good for most practical purposes (or even correct with BRAMs, maybe) but not truly correct.
Yes, I see your point: not truly accurate but a good approximation for our memory system and benchmarks.
My point was to say that the completion signals are more fundamental even beyond Rouzbeh's use case, and I think they can be used to get rid of the network of CMerges completely by adding a minimal network circulating such signals to calculate the mem_end
. I can make my idea more concrete and show some examples at some point, once we have the completion signals.
Thanks for bringing this up, I think this is a good idea and I only have one small implementation concern. Perhaps some people care deeply about having store ports which do not have this completion signal. This does not include me and I don't know why somebody would care very much given that one can simple ignore it, but I'm CC-ing people who potentially may care just in case (please shout if you do!) @Jiahui17 @Carmine50 @murphe67 @qianxu1998.
If nobody minds the extra signals, you handshake::StoreOp
's operands/results would now look like this (in Tablegen).
let arguments = (ins IntChannelType:$address, ChannelType:$data, ControlType:$memDone);
let results = (outs IntChannelType:$addressResult, ChannelType:$dataResult, ControlType:$memDoneResult);
If people mind you will have to make these extra argument/result optional.
let arguments = (ins IntChannelType:$address, ChannelType:$data, Optional<ControlType>:$memDone);
let results = (outs IntChannelType:$addressResult, ChannelType:$dataResult, Optional<ControlType>:$memDoneResult);
You would also need an operation verifier to make sure that either both the argument and result signals exist or none exist.
For the challenges you outline.
The completion signal should be produced in
handshake::LSQOp
andhandshake::MemoryControllerOp
for every store connected to them.
Yes. Now keep in mind that our memory interfaces' RTL implementations are generally terrible so if you are going to add this your choices are either to stack your change half-hazardly on top of the mess or first try to somewhat fix the mess to build on stronger foundations. There are two things that are especially bad.
Fixing this is most likely not trivial, and my engineering skills are quite limited in this area. I would be very appreciative if this was fixed, but I understand this may not be a priority.
There should be a sink inserted for the output of store operation anywhere this signal is not used (which is probably everywhere except my use case).
Handled automatically by the fork/sink materialization pass, nothing to do here.
The new signal should also be taken into consideration when translating handshake to HW.
Handled automatically as well. You would just need to modify the RTL component's implementation and possibly the RTL config as well if the extra operand/result are optional (in that case there is also a one-line change to make in HandshakeToHW
).
Regarding the mem_end
question, I let you all decide what makes sense. I am obviously ok with whatever is functionally correct so I have nothing to add. Let me just acknowledge that the implementation of the mem_end
logic inside our memory interfaces was half-baked by yours truly during my coffee-fueled last official weeks on the project, so I would not give it any kind of confidence above the bare minimum. Surprisingly, it seems to work sort of ok (as in nothing explodes) but it is for sure not actually correct; as soon as the control network signals that the last BB is reached and there are no pending requests the signal is asserted. This may be accurate in 90% of our benchmarks, but only by sheer luck. If someone decides to fix out interfaces' idiosyncrasies that will need a rework as well, and at that point it will be possible to generate this signal reliably.
I have a project I was going to pitch to Lana soon (possibly overlapping with what @rpirayadi is already doing) that this would benefit, so at first glance it seems like the work involved to make it possible would be worth it for several reasons.
Hi all! I like the plan and I just want to add some clarifications here:
...indeed, we need that (to be sure that there will be no more stores), but we also need to know that the last store has completed.
@paolo-ienne This is correct, now it is solved in a slightly hacky way: In the default CfToHandshake-generated circuits, every BB with >= 1 store port to a memory controller will fork a control token (from the network of CMerges) to register the number of upcoming pending stores in a counter in the memory controller (counted down whenever a store is completed). The program returns if all 3 criteria hold:
void
),Using the completion signal seems more natural as it makes the return mechanism transparent to the circuit (now the circuit side has no clue how many stores it has to wait before it can return); this is also necessary for a fast-token circuit not to terminate too early (@AyaElAkhras, @pcineverdies ), so this change is a win-win. But for now, I can't think of how to handle these completion signals...
I plan to start adding the completion signal to the new handshake::StoreOp
added recently in here.
This is what @lucas-rami said earlier.
Yes. Now keep in mind that our memory interfaces' RTL implementations are generally terrible so if you are going to add this your choices are either to stack your change half-hazardly on top of the mess or first try to somewhat fix the mess to build on stronger foundations. There are two things that are especially bad.
For now, I do not plan to approach the problem in the most appropriate way, so, I won't attempt to fix the problems in RTL. What I plan to do is to add a small circuit to the current RTL implementation of the memory controller (and if possible the LSQ) that simply sends the completion signal with a single-cycle delay after receiving the inputs which is correct assuming we are using BRAMs. This way, I can modify the implementation at higher levels and have the completion signal in the MLIR.
By doing this, there is the advantage that one can check whether there are any issues in adding the completion signal or not. It is of course possible to substitute this implementation with a more accurate one later.
Currently, the
handshake::LSQStoreOp
andhandshake::MCStoreOp
do not provide a completion signal. More precisely, they produce only two results (i.e. address and data value which are fed tohandshake::LSQOp
andhandshake::MemoryControllerOp
recpectively). Generally speaking, for keeping track of memory consistency, it is nice to have a completion signal. Moreover, I would like to sequentialize load and stores without using the LSQ, so I need the store to produce this signal. This way, I can make sure the successor doesn't start before the store finishes.The challenges I believe exist for adding this signal are listed below.
handshake::LSQOp
andhandshake::MemoryControllerOp
for every store connected to them.I would appreciate any insights related to the procedure of adding such signal to store operations from @lucas-rami or anyone else who has a comment.
handshake::LSQStoreOp
andhandshake::MCStoreOp
may be merged into onehandshaek::StoreOp
and the signal could be add to this operation.