EPFL-LAP / dynamatic

DHLS (Dynamic High-Level Synthesis) compiler based on MLIR
Other
65 stars 19 forks source link

[Handshake] Unify MC/LSQ memory ports #193

Closed lucas-rami closed 2 days ago

lucas-rami commented 5 days ago

The operations representing load and store ports to memory interfaces are currently different based on the nature of the memory interface they connect to. We have two types of memory interfaces (MC and LSQ) so that is a total of 4 ports (2 load ports and 2 store ports). This design choice was made relatively early in the project's development was and motivated by the fact that these ports' respective RTL implementation were different based on whether they connected to an MC or LSQ. At the Handshake level, however, both types of load ports have the same interface and behave exactly in the same way; the same applies to both types of store ports. The differentiation causes some code duplication and complicate IR transformations of the memory network.

This unifies the handshake::MCLoadOp and handshake::LSQLoadOp operations into a single handshake::LoadOp operation, as well as the handshake::MCStoreOp and handshake::LSQStoreOp operations into a single handshake::StoreOp operation. Notable changes throughout the codebase include the following.

This change is related to the discussion in #185. It also is related to #135/#136, but does not solve the issue (interestingly the benchmark no longer deadlocks but produces an incorrect result). This requires more investigation.

CC-ing people who might be interested in the change @rpirayadi @paolo-ienne @pcineverdies @Jiahui17 @lana555.

Jiahui17 commented 5 days ago

Thank you for your contribution! I have an immediate question: how do we account for the extra latency of the load port when it is connected to an LSQ after this change?

lucas-rami commented 4 days ago

Do you mean during buffer placement? I have removed the "pass-through" LSQ load RTL implementation in favor of the MC load so they now share the same timing model which has a 1 cycle latency.

Jiahui17 commented 4 days ago

Do you mean during buffer placement?

Yes. I think this is how it works: When a load port is connected to an MC (+BRAM), it takes one cycle for the memory to return the result. But when a load port is connected to an LSQ (+BRAM), it takes a minimum of 4 (or 5) cycles, because the LSQ itself has some internal pipeline stages for the requests.

Maybe we can just +3 cycles latency if the operation is a load connected to an LSQ.


I think this is more like the fault of the timing model (it also doesn't account for a few other things like the effect of sharing memory ports on II (we need to open an issue on this)): the 1 cycle latency (MC) or 4+ cycles latency (LSQ) is a characteristic of the memory interface rather than the memory port.

lucas-rami commented 4 days ago

Thanks for the insights, I had little idea of all of that.

Maybe we can just +3 cycles latency if the operation is a load connected to an LSQ.

I can hack this around while we figure out something cleaner if you think this is a good idea.


To make a wider point, I think our system for handling timing models is the biggest current limitation of Dynamatic. It is just really unusable and needs a big rework. Now that we have a proper DOT parser in the project we could envision encoding semi-abstract timing models in DOT format (with full port-to-port timing characteristics) and parsing that instead of our current weird JSON. If there is some interest from someone in doing this, I don't mind giving this some of my time as well (for the DOT representation/parsing/buffer placement adaptation).