[Handshake] Unify MC/LSQ memory ports

lucas-rami commented 5 days ago

The operations representing load and store ports to memory interfaces are currently different based on the nature of the memory interface they connect to. We have two types of memory interfaces (MC and LSQ) so that is a total of 4 ports (2 load ports and 2 store ports). This design choice was made relatively early in the project's development was and motivated by the fact that these ports' respective RTL implementation were different based on whether they connected to an MC or LSQ. At the Handshake level, however, both types of load ports have the same interface and behave exactly in the same way; the same applies to both types of store ports. The differentiation causes some code duplication and complicate IR transformations of the memory network.

This unifies the handshake::MCLoadOp and handshake::LSQLoadOp operations into a single handshake::LoadOp operation, as well as the handshake::MCStoreOp and handshake::LSQStoreOp operations into a single handshake::StoreOp operation. Notable changes throughout the codebase include the following.

Removal of handshake::LoadOpInterface and handshake::StoreOpInterface, which were used to factor out common behavior between the two access ports for each access type. This is obviously no longer needed.
Significant simplifications in the handshake::MemoryPort class hierarchy and in the HandshakeReplaceMemoryInterfaces pass.
Introduction of a handshake::findMemInterface function that runs a restricted BFS from a Value to identify the potential memory interface that it "connects" to. This is used in places where we used to use the specific type of a memory port to identify the specific memory interface type it connected to.
Unification of timing models and RTL components. The implementation of RTL ports for the MC becomes the universal ones for all memory interface types. This is perhaps undesirable in the long term, but works for now as these are the most "conservative" ones (the load port has buffers on channels connecting to the memory interface and the store port joins its address and data inputs). RTL configurations and timing models are adjusted as needed.
There is a special case in the creation of throughput constraints during smart buffer placement wherein channels to store ports are omitted from the throughput equations. This was ported directly from the legacy implementation but does not seem to make much sense. I added a TODO there as part of this commit for future reference.

This change is related to the discussion in #185. It also is related to #135/#136, but does not solve the issue (interestingly the benchmark no longer deadlocks but produces an incorrect result). This requires more investigation.

CC-ing people who might be interested in the change @rpirayadi @paolo-ienne @pcineverdies @Jiahui17 @lana555.

Jiahui17 commented 5 days ago

Thank you for your contribution! I have an immediate question: how do we account for the extra latency of the load port when it is connected to an LSQ after this change?

lucas-rami commented 4 days ago

Do you mean during buffer placement? I have removed the "pass-through" LSQ load RTL implementation in favor of the MC load so they now share the same timing model which has a 1 cycle latency.

Jiahui17 commented 4 days ago

Do you mean during buffer placement?

Yes. I think this is how it works: When a load port is connected to an MC (+BRAM), it takes one cycle for the memory to return the result. But when a load port is connected to an LSQ (+BRAM), it takes a minimum of 4 (or 5) cycles, because the LSQ itself has some internal pipeline stages for the requests.

Maybe we can just +3 cycles latency if the operation is a load connected to an LSQ.

I think this is more like the fault of the timing model (it also doesn't account for a few other things like the effect of sharing memory ports on II (we need to open an issue on this)): the 1 cycle latency (MC) or 4+ cycles latency (LSQ) is a characteristic of the memory interface rather than the memory port.

lucas-rami commented 4 days ago

Thanks for the insights, I had little idea of all of that.

Maybe we can just +3 cycles latency if the operation is a load connected to an LSQ.

I can hack this around while we figure out something cleaner if you think this is a good idea.

To make a wider point, I think our system for handling timing models is the biggest current limitation of Dynamatic. It is just really unusable and needs a big rework. Now that we have a proper DOT parser in the project we could envision encoding semi-abstract timing models in DOT format (with full port-to-port timing characteristics) and parsing that instead of our current weird JSON. If there is some interest from someone in doing this, I don't mind giving this some of my time as well (for the DOT representation/parsing/buffer placement adaptation).

EPFL-LAP / dynamatic

[Handshake] Unify MC/LSQ memory ports #193