05/06/2023: better LAB packing in nextpnr-mistral

more nextpnr-mistral things!

I've talked about the LABsLogic Array Blocks a little bit before; they contain 10 ALMsAdaptive Logic Modules each, and are essentially the heart of the Intel Cyclone V.

each ALM has 8 inputs, and these inputs can either come from global routing through the TDTile Dispatch multiplexers, or from other ALM outputs through the LDLocal Dispatch multiplexers.

now, since there are 10 ALMs, and each ALM has two logic outputs (...ish), there are 20 LD muxes. to feed these 10 ALMs with their 8 inputs each, there must be 80 TD muxes, right?

unfortunately, there are 46 TD muxes in a LAB instead of 80. in other words, a LAB can only have at most 46 unique external signals as inputs to the ALMs, and nextpnr-mistral reserves four TDs for control signals to the DFFsD Flip Flops, giving 42 TDs for ALM input signals. this makes TD accounting important to maximise utilisation of the LAB.

the old way

in the very beginning, there was no TD accounting, and so the placer filled LABs with as much stuff as it could, entirely unaware of TD limits, and so nothing routed because this inevitably resulted in needing to route multiple signals through the same TD, which is not possible in hardware.

this was not a particularly great situation, so having counted that a LAB has 42 TDs available for use by ALMs, gatecat wrote some code which counted the number of inputs to the ALMs, subtracted cases where individual ALMs shared inputs across multiple LUTs, and rejected placer solutions which had more than 42 inputs according to this count. I'm going to refer to this method as "counting".

counting effectively limits the LAB to having 42 unique inputs, reserving one TD mux for every ALM input. this overestimates TD usage; a signal going through a TD can route to multiple inputs, rendering the TDs reserved for those inputs unused. however, overestimating like this makes life easier for the router, because it results in fewer signals needing to be routed to a LAB, and less time spent figuring out how to map signals to TD muxes ("resolving TD congestion").

a different way

I'm not going to call my approach strictly "better" because increasing TD utilisation also increases time spent resolving TD congestion.

anyway, since TDs can each provide a signal, why not put all the signals into a hash set and reject solutions if the size of the hash set is over 42 elements? let's call this approach "hashing".

this was my initial idea, and after adding some code to count the number of LABs used, hashing needed 10% fewer LABs than counting. however, a placement using hashing would not route; it would be really close to routing but never succeed, and I spent days bashing my head at this problem.

the depths of the LAB

a fairly little-known feature of nextpnr is that it exposes a fairly large part of the internal architecture API to scripts through Python. this means one can, for example, bring up the Python REPL to play with things. combine this with Mistral's full knowledge of the routing graph and bitstream, and we can explore the LAB.

Mistral refers to the inputs of the ALM as GOUTGlobal [Routing] Output nodes, and this is exposed by nextpnr-mistral as wires with the name GOUT.XX.YY.ZZ, where XX and YY are the X and Y coordinates of the LAB, and ZZ is an index referring to the specific ALM input.

the ALM in question is multiplied by eight (because there are eight ALM inputs) and then the ALM input index is added to it. for reasons, the 8 inputs of the ALM are from index zero to seven: E0, F0, A, B, C, D, E1, F1. so, GOUT.1.1.2 is the A input of ALM 0 for the LAB at (1, 1).

knowing this, one can create the name of an ALM input and the nextpnr API will look up its corresponding wire whenever you call into it. given a wire, you can use ctx.getPipsUphill(wire) to iterate over the PIPsProgrammable Interconnect Points that use the wire as a destination, which gives you the TDs and LDs that connect to an ALM input.

if one goes over all the ALM inputs within a LAB, one discovers a few things:

inputs can only read from about half the available TDs. this is a not-uncommon thing vendors do, since having half the muxes takes half the area.
the pattern of mux connections is approximately this: there's a pattern for all A and C inputs, another pattern for all B and D inputs, yet another pattern for E0 and E1, and a fourth pattern for F0 and F1.
there is some overlap between these patterns, such that some TDs could also share signals between other input groups.

this is effectively a constraint satisfaction problem, and I have no experience with these (but I'd like to hear from those who do!).

so instead I decided on a compromise: hashing signals alone underestimates TD usage, but how about if one assigns these TD mux patterns a number and then hashes (signal, TD mux pattern) instead?

while worse than pure hashing, it is still an improvement in density to do this.

things for the future

I mentioned those LD muxes a while back, and then didn't do anything with them. if one can do LD accountancy too, then one would avoid reserving TDs for internal signals, which would further increase LAB utilisation. however, I have my suspicions that the LDs also only connect to some ALM inputs and would need to do some more investigation.

another thing is DFFs; DFFs packed with unrelated LUTs source their inputs from E0/E1 or F0/F1, and we don't know in advance which one. that means that we must reserve both, effectively costing two TDs per DFF. not cheap.

so does this help corescore?

probably not right now. perhaps in the future?

PS: are these getting formulaic? I'm trying to keep these interesting. also I have a patreon.

Ravenslofty / blog