Closed lucas-rami closed 2 days ago
Thank you for your contribution! I have an immediate question: how do we account for the extra latency of the load port when it is connected to an LSQ after this change?
Do you mean during buffer placement? I have removed the "pass-through" LSQ load RTL implementation in favor of the MC load so they now share the same timing model which has a 1 cycle latency.
Do you mean during buffer placement?
Yes. I think this is how it works: When a load port is connected to an MC (+BRAM), it takes one cycle for the memory to return the result. But when a load port is connected to an LSQ (+BRAM), it takes a minimum of 4 (or 5) cycles, because the LSQ itself has some internal pipeline stages for the requests.
Maybe we can just +3 cycles latency if the operation is a load connected to an LSQ.
I think this is more like the fault of the timing model (it also doesn't account for a few other things like the effect of sharing memory ports on II (we need to open an issue on this)): the 1 cycle latency (MC) or 4+ cycles latency (LSQ) is a characteristic of the memory interface rather than the memory port.
Thanks for the insights, I had little idea of all of that.
Maybe we can just +3 cycles latency if the operation is a load connected to an LSQ.
I can hack this around while we figure out something cleaner if you think this is a good idea.
To make a wider point, I think our system for handling timing models is the biggest current limitation of Dynamatic. It is just really unusable and needs a big rework. Now that we have a proper DOT parser in the project we could envision encoding semi-abstract timing models in DOT format (with full port-to-port timing characteristics) and parsing that instead of our current weird JSON. If there is some interest from someone in doing this, I don't mind giving this some of my time as well (for the DOT representation/parsing/buffer placement adaptation).
The operations representing load and store ports to memory interfaces are currently different based on the nature of the memory interface they connect to. We have two types of memory interfaces (MC and LSQ) so that is a total of 4 ports (2 load ports and 2 store ports). This design choice was made relatively early in the project's development was and motivated by the fact that these ports' respective RTL implementation were different based on whether they connected to an MC or LSQ. At the Handshake level, however, both types of load ports have the same interface and behave exactly in the same way; the same applies to both types of store ports. The differentiation causes some code duplication and complicate IR transformations of the memory network.
This unifies the
handshake::MCLoadOp
andhandshake::LSQLoadOp
operations into a singlehandshake::LoadOp
operation, as well as thehandshake::MCStoreOp
andhandshake::LSQStoreOp
operations into a singlehandshake::StoreOp
operation. Notable changes throughout the codebase include the following.handshake::LoadOpInterface
andhandshake::StoreOpInterface
, which were used to factor out common behavior between the two access ports for each access type. This is obviously no longer needed.handshake::MemoryPort
class hierarchy and in theHandshakeReplaceMemoryInterfaces
pass.handshake::findMemInterface
function that runs a restricted BFS from aValue
to identify the potential memory interface that it "connects" to. This is used in places where we used to use the specific type of a memory port to identify the specific memory interface type it connected to.This change is related to the discussion in #185. It also is related to #135/#136, but does not solve the issue (interestingly the benchmark no longer deadlocks but produces an incorrect result). This requires more investigation.
CC-ing people who might be interested in the change @rpirayadi @paolo-ienne @pcineverdies @Jiahui17 @lana555.