02/06/2023: the corescore challenge (part 1)

this is a rough post on what I've tried doing to improve the performance of nextpnr-mistral on a specific benchmark: CoreScore. it might end up becoming a series, depending on how inspired I get to try to work on this.

what's corescore?

in 2018, there was a competition held by the risc-v foundation for open-source soft-cpu designs. the winner for the "creativity prize" of the competition was SERVbit-SErial Risc-V by olof kindgren. SERV is a very tiny risc-v core because instead of doing operations with, say, 32 bits in one clock cycle, it processes a single bit per clock cycle ("bit-serial" logic), which makes the core logic minimal. if you want to learn more about it, olof's done a talk about it.

anyway, after the competition, olof kept tinkering with SERV, and eventually came up with a toolchain benchmark: how many SERV cores can one fit on an FPGA? that seemingly-simple question tests quite a lot of the toolchain: better synthesis makes SERV smaller, which means you can fit more of them on there; better placement and routing means you can more efficiently make use of the FPGA to fit more of them in there.

this benchmark is CoreScore, and there's even a small website about the largest number of SERV cores that can be fit on various boards.

for my purposes, the target board is the Terasic DE10-Nano which has an Intel Cyclone V FPGA on it, on which Quartus can fit 271 SERV cores.

Mistral...can only fit 84 of them. time to get to work.

the symptoms

Info: Device utilisation:
Info:           MISTRAL_COMB: 21873/83820    26%
Info:             MISTRAL_FF: 21469/167640    12%
Info:             MISTRAL_IO:     3/  472     0%
Info:         MISTRAL_CLKENA:     1/    2    50%
Info:    cyclonev_oscillator:     0/    1     0%
Info:   cyclonev_hps_interface_mpu_general_purpose:     0/    1     0%
Info:           MISTRAL_M10K:    84/  553    15%

these are the nextpnr device utilisation statistics for the 84-core SERV SoCSystem-on-Chip for Mistral.

one might first notice that utilising only 26% of the available LUTsLook-Up Tables suggests that there is a lot of headroom available, so why is this the limit for the tooling?

well, the device utilisation section only tells us a little bit of the story: there are more limits to FPGA utilisation than simply "how many potential places for LUTs"Basic Elements of Logic" (bels) from last article are actually filled".

control set hell

FPGAsField-Programmable Gate Arrays have LUTs to provide combinational logic, and DFFsD Flip-Flops to provide individual bits of memory to store the results of that logic.

a not-uncommon operation on DFFs is storing zero to a result instead of the intended value on a particular condition; this is called synchronous-clear (synchronous because the clear only happens on a clock edge). this is relatively cheap to implement in hardware (it's an AND gate before the DFF input), so some vendors offer it to save LUTs.

on some FPGAs, synchronous-clear is cheap(ish). for example, a Lattice ECP5 PFUProgrammable Function Unit containing eight LUT4s has two set/clear wires, each of which may be synchronous or asynchronous.

the Intel Cyclone V goes a different route: each LABLogic Array Block has two dedicated asynchronous-clear wires, and one dedicated synchronous-clear wire which a flop may choose to ignore.

features such as synchronous-clear are called DFF controls, and the particular set of signals that drives the DFF controls is the "DFF control set". it's important that the control sets of flops packed in a LAB do not conflict: while a DFF in a LAB which has no synchronous-clear can be packed with a DFF which depends on a synchronous-clear, one cannot put two DFFs with different synchronous-clear signals in the same LAB. effectively, this means the maximum number of synchronous-clear signals available in the Cyclone V is the same as the number of LABs on the chip.

all this results in significantly fewer synchronous-clear resources on the Cyclone V compared to the ECP5. the chip on the DE10-Nano, the 5CSEBAU23I7 has 83,820 potential LUT slots; a comparable ECP5, the LFE5UM-85F has 83,640 potential LUT slots, but the Cyclone V has only 4,191 synchronous-clears (one per LAB of 10 ALMs each containing two LUT slots) compared to (up to) 20,910 synchronous-clears on the ECP5 (one per pair of LUTs).

why did I go on a multi-paragraph ramble about a relatively-obscure optimisation in FPGA hardware? well, Yosys really likes to infer synchronous-clear signals when available to reduce area, and SERV is optimised to make maximum use of these synchronous-clear signals to reduce area. all this combines to make SERV's optimisation for other architectures (where these things are cheap) rather expensive for Cyclone V.

now, the corescore rules require the toolchain to be as-upstream, so obviously I can't just hack my local copy of Yosys and claim victory.

a reasonable approach

I mentioned before that Yosys likes inferring synchronous-clears. this is because going from complex flops to simple flops is (relatively) easy, but going from simple flops to complex flops is harder. the Yosys pass for turning complex flops to simple flops is called dfflegalize, and it takes a description of the supported flops and the available flop-initialisation values.

in synth_intel_alm, one can legalise to:

positive-edge-clocked D flip-flops with negative-polarity asynchronous reset and positive-polarity clock enables that initialise to zero, or
positive-edge-clocked D flip-flops with positive-polarity synchronous reset and positive-polarity clock enables (that have priority over the reset).

that's a lot, and you don't have to worry about it too much, except that the description lets dfflegalize know that, for example, it can't use a flop that has both synchronous and asynchronous reset (which is a frankly terrifying proposition).

now, to handle cases like the Intel Cyclone V (or more specifically the Lattice iCE40, which was designed by SiliconBlue which was a startup of ex-Intel employees), dfflegalize has an option called -minsrst N, which if specified will only use synchronous-clear when that signal is used by more than N flops, and otherwise emulate it with logic.

since 100 is a nice round number, let's see what it takes to build a SoC with 100 SERV cores.

since a synchronous-reset is five times more expensive than an ECP5, let's start with an N of 5.

Info: Device utilisation:
Info:           MISTRAL_COMB: 27837/83820    33%
Info:             MISTRAL_FF: 27819/167640    16%
Info:             MISTRAL_IO:     4/  472     0%
Info:         MISTRAL_CLKENA:     1/    2    50%
Info:    cyclonev_oscillator:     0/    1     0%
Info:   cyclonev_hps_interface_mpu_general_purpose:     0/    1     0%
Info:           MISTRAL_M10K:   101/  553    18%

Info: Placed 0 cells based on constraints.
Info: Creating initial analytic placement for 52917 cells, random placement wirelen = 5743416.
Info:     at initial placer iter 0, wirelen = 76
Info:     at initial placer iter 1, wirelen = 78
Info:     at initial placer iter 2, wirelen = 78
Info:     at initial placer iter 3, wirelen = 78
Info: Running main analytical placer, max placement attempts per cell = 10000.
ERROR: Unable to find legal placement for cell 'corescorecore.core_43.serving.rf_ram_if.rcnt_MISTRAL_FF_Q' after 10001 attempts, check constraints and utilisation. Use `--placer-heap-cell-placement-timeout` to change the number of attempts.
0 warnings, 1 error

one cup of tea later, it fails.

okay, let's double it and try again.

Info: Device utilisation:
Info:           MISTRAL_COMB: 28669/83820    34%
Info:             MISTRAL_FF: 27819/167640    16%
Info:             MISTRAL_IO:     4/  472     0%
Info:         MISTRAL_CLKENA:     1/    2    50%
Info:    cyclonev_oscillator:     0/    1     0%
Info:   cyclonev_hps_interface_mpu_general_purpose:     0/    1     0%
Info:           MISTRAL_M10K:   101/  553    18%

Info: Placed 0 cells based on constraints.
Info: Creating initial analytic placement for 53749 cells, random placement wirelen = 5794611.
Info:     at initial placer iter 0, wirelen = 76
Info:     at initial placer iter 1, wirelen = 78
Info:     at initial placer iter 2, wirelen = 78
Info:     at initial placer iter 3, wirelen = 78
Info: Running main analytical placer, max placement attempts per cell = 10000.
Info:     at iteration #1, type MISTRAL_COMB: wirelen solved = 21667, spread = 3861671, legal = 3862040; time = 0.83s
ERROR: Unable to find legal placement for cell 'corescorecore.core_48.serving.cpu.ctrl.i_jump_MISTRAL_FF_Q' after 10001 attempts, check constraints and utilisation. Use `--placer-heap-cell-placement-timeout` to change the number of attempts.
0 warnings, 1 error

nope. notice how the number of MISTRAL_COMB cells increases as more soft logic is used to emulate synchronous-clear.

no dice

so, I kept doubling the -minsrst value, but any reasonable value of -minsrst (or 1280, which verges on unreasonable) results in the same results as above. I even tried -minsrst 12800, which still fails.

I think the problem comes down to there being a few giant synchronous-clear networks, and nextpnr is handling this situation terribly. it's certainly something to consider handling "properly" (perhaps we should do global promotion? to investigate).

but I've spent quite a while staring at Yosys building a netlist only for nextpnr to fail placing it. time to call it a day.

Ravenslofty / blog