Ravenslofty / blog

The Critical Path - a rambly FPGA blog
49 stars 2 forks source link

20/11/2022: so what have you been working on, lofty? #8

Open Ravenslofty opened 1 year ago

Ravenslofty commented 1 year ago

this is a crosspost from my cohost post two-ish weeks ago. people might have missed me posting the link, and I also figured having an extra copy would be useful.

it's true that I've been quiet a lot recently (at least partially down to it being winter), but I decided to set myself a project: write a highly parallel FPGA router.

but to explain my project ideas, I must first explain how FPGA routing works in general.


how FPGA routing works in general

so, a router gets handed a list of (1 source wire, N sink wires) pairs ("nets"), and told to find a valid path for all of them, such that each physical wire on the FPGA has no more than one driver.

now, the first thing that happens to a net when it's routed is that it gets broken down into N (1 source wire, 1 sink wire) pairs ("arcs"), so that one can use A* from the source to the sink.

Now, with exceptions (nextpnr router1), A* is allowed to route wires across already-used wires, but a cost function is used to make it more expensive to do so for wires which aren't critical-path. By doing iterations of that with the costs increasing, eventually the wires get spread out enough to satisfy the "no more than one driver" criteria.

observation one: this is entirely serial

how to parallelise this? one way is to draw a bounding box around the X/Y coordinates of the source and sink wires in a net, and categorise them by where in the FPGA the bounding boxes are, and then route as many of these categories in parallel as possible.

other suggestions are to use fancy graph libraries which attempt to route all arcs at the same time while trying not to step on the other threads' toes.

observation two: A* is worst-case exponential in path length

approaches like the ones laid out above strike me as optimising the wrong thing. categorising by bounding boxes routes the smallest nets in parallel, but these are also often the nets that are fastest to route; the long nets are still routed serially, and this is very slow.

observation three: FPGAs are grid-like

a little bit of this is exploited in the categorisation approach; nextpnr's router2 uses quarters and halves (split vertically and horizontally) of the FPGA to route more efficiently.

hierarchical A implementations in the literature cache paths between nodes to optimise finding the best path between them in the future. directly implementing this doesn't work because the cost function for A changes, but perhaps there are still ways to exploit hierarchy for parallelism.

a better way to route, I think

if one starts top-down with the grid-like structure of the FPGA, one can divide it into quarters. these segments have wires interconnecting them, and we can view those as the boundary of the segments. we can then partition nets along those boundaries, producing smaller nets which are inside those boundaries.

as an example, if a net goes from the northeast segment to the southeast segment, it gets partitioned along the east boundary wires, producing a net for the northeast source wire to the east wire, and a net for the east wire to the southeast sink wire.

I think it's best to do partitioning using A* and the same cost function as above; since there aren't many boundary wires, this is fast (enough).

since the maximum path length of a net is now bounded by the segment it is in, the path length of that net decreases.

creating arbitrarily large amounts of parallelism

the above process can be done recursively; if one iteration splits an entire FPGA into four segments, the segments can be themselves split into four subsegments. that means we could potentially route 16 nets at the same time.

I think there's still room for improvement here, though. intuitively, I think the segments don't have to be exact quarters, but instead could be split such that each segment has the same amount of nets in it.

there's still the question of what to do towards the end of a routing iteration though; as threads complete their work, they would have nothing to do. while one could imagine the final thread splitting its segment for the other threads to work on, I suspect stopping routing to do that would reach a point where actual routing speed goes down a lot.