Automated Load-Store-Queue Sizing

patschu11 commented 3 months ago

This PR adds a pass, which sizes the Load and Store Queues, to the minimal depth, which still allows optimal throughput. This is done by analyzing the timing of LSQ loads/stores for each CFDFC separately and using the maximum queue sizes of any CFDFC. The pass is run between the handshake-canonicalization and the handshake-hw conversion.

Important side notes

There were wrong latencies for some components in the components.json file, I fixed them to match the latencies I observed in the simulation.
Currently this algorithm only works with fpga20-legacy buffer placement, since in fpl22 and simple-buffer placement, not all the information is passed through the attributes. When not all necessary information is available, and therefore LSQ sizing cannot be done, the default depth will be assigned to the LSQs.

Overview

The following things were implemented:

The pass receives information on CFDFCs and their respective throughputs from the buffer-placement pass via Handshake attributes (see #105 ).
With this information the CFDFCs are reconstructed and the sizing algorithm determines the queue depths according to the following paper: Load-Store Queue Sizing for Efficient Dataflow Circuits
The sizing information is passed on as a Handshake attribute in the corresponding LSQ operation. The attribute is a tuple, with first the Load-Queue size and second the Store-Queue size. e.g. lsqDepth = #handshake<lsqDepth[3, 5]>
Furthermore the logic to read this information is added in the legacy backend and the new backend to pass it on to the LSQ generator. For this, the LSQGenerationInfo class has been modified to read the lsqDepth attribute. If the attribute does not exist, the default size (16) will be used for both the load and store queue.

Arguments

The pass needs the compontents.json file as a timingDB, which is passed similarly to buffer placement. Furthermore, it is possible to specify different collision scenarios to the pass, by the collisions=none/half/full argument. These assume one of the following cases.

none: None of the LSQ memory operations have collisions
half: Every other LSQ memory operation has collisions
full All of the LSQ memory operations have collisions

Details

Algorithm data structure

The sizing algorithm works on the different CFDFCs of the kernel, determined in buffer placement. The CFDFCs are represented, as an adjacency list (AdjListGraph made of AdjListNodes), since most of the algorithm steps are variations of path-finding between different nodes.

Detailed steps of the algorithm

The sizing algorithm inside the LSQ Sizing pass has the following steps:

Reconstuct the CFDFCs
Identifying Start Node and Phi Nodes
Adding additional edges to account for some edge cases
Find the earliest start time of each node and save it
Finding the worst-case II for each CFDFC
Set correct IIs for different collision scenarios for each CFDFC
Determine allocation times for each CFDFC
Determine deallocation times for each CFDFC
Calculate queue depths based on all the timing information and IIs from all CFDFCs

The implementation details are described in the semester thesis report (Automated Load-Store Queue Sizing in Dynamatic based on MLIR)

Results

The results match the one of the paper, namely the algorithm can find the optimal LSQ size for the steady-state, given that the correct sequence of IIs is provided, either by buffer placement or manually. The detailed results can be found in the semester thesis report: (Automated Load-Store Queue Sizing in Dynamatic based on MLIR)

Jiahui17 commented 2 months ago

@patschu11 Hi Patrick thank you so much for the contribution!

Before digging into the detailed stuff (i.e., coding style, naming, etc), I have a very high-level comment on the PR:

Why is the LSQ sizing a separate pass instead of inside the buffer placement pass (e.g., after instantiating all buffers in placeUsingMILP)? The buffer pass seems to have most of the functionalities that you have re-implemented in the PR (e.g., units and channels inside a CFDFC, etc.). https://github.com/EPFL-LAP/dynamatic/blob/709f10362d2d5887042f2a8a48bc0cd8d064a8e5/lib/Transforms/BufferPlacement/HandshakePlaceBuffers.cpp#L194-L198

patschu11 commented 2 months ago

@patschu11 Hi Patrick thank you so much for the contribution!

Before digging into the detailed stuff (i.e., coding style, naming, etc), I have a very high-level comment on the PR:

Why is the LSQ sizing a separate pass instead of inside the buffer placement pass (e.g., after instantiating all buffers in placeUsingMILP)? The buffer pass seems to have most of the functionalities that you have re-implemented in the PR (e.g., units and channels inside a CFDFC, etc.). https://github.com/EPFL-LAP/dynamatic/blob/709f10362d2d5887042f2a8a48bc0cd8d064a8e5/lib/Transforms/BufferPlacement/HandshakePlaceBuffers.cpp#L194-L198

Hi Jiahui,

Although on a high level some functionalities that the buffer placement implements are similar, the data structures (e.g. CFDFC class) and corresponding algorithms are not sufficient for LSQ sizing. Only knowing the units and channels of a CFDFC would not have been enough, since there are some special features needed (e.g. inserting pseudo nodes into the graph that do not correspond to any Operation). Therefore it was easier and cleaner to implement the LSQ sizing creating a dedicated data structure for the graph instead of the CFDFC stuff from buffer placement. Implementing these features in the existing buffer placement structures would have necessitated a lot of changes, which could have impacted other features and the people working on them. We discussed this in the beginning with Lucas and decided to make it a dedicated pass since barely any functionality or information could be re-used from buffer placement. Then it is also possible for the user to decide if they want to use the LSQ sizing pass or tune the size manually instead. Also, the list of BBs and II for every CFDFC will be needed by other passes in the future as well, as far as I know, therefore it made sense to us, to encode this information as attributes, instead of building a wrapper around buffer placement.

lucas-rami commented 1 month ago

Thanks for the contribution and for tagging me on this. I will try to take a look at this early this week, probably focusing more on the parts of the "regular flow" that are affected, as opposed to the new pass itself.

lucas-rami commented 1 month ago

This is good to merge for me (modulo the two cosmetic changes to make following my answers above) assuming someone else looked at the algorithm being implemented. @Jiahui17 did you or someone else had a look? If so feel free to approve and merge, and thanks for the contribution @patschu11!

Jiahui17 commented 1 month ago

Hi, Patrick! Thanks again for the contribution

I added a few more comments/questions

EPFL-LAP / dynamatic