YosysHQ / nextpnr

nextpnr portable FPGA place and route tool
ISC License
1.26k stars 237 forks source link

one hot encoding reaches a much lower fmax #987

Closed rowanG077 closed 2 years ago

rowanG077 commented 2 years ago

I have been trying to optimize some parts of a design I have to try and reach a higher fmax. One of the things I tried was try and encode states, inputs etc as one-hot so the decoding logic would shorter. But when I do this I in fact see the opposite effect. A much lower fmax.

See this gist for a simple 16 element counter that is implemented either using a one-hot or binary encoding. The full yosys+nextpnr output is also available in the gist.

Using the binary counter I reach an fmax of ~400-500Mhz but using the onehot encoding it only reaches ~130-180Mhz. I would have expected the inverse. What is the reason for this.

I route the inputs and outputs to random pins and use these commands to synthesize:

yosys -p "synth_ecp5 -abc2 -top Wrapper -json ./out/synth.json; " ./out/hdl/*.v
nextpnr-ecp5 --json ./out/synth.json --lpf ./lpf/rev70.lpf --textcfg ./out/pnr.cfg --25k --speed 6 --package CABGA256 --randomize-seed

So I guess this is just completely the opposite of what I would have expected. Is this something nextpnr doesn't support well?

Ravenslofty commented 2 years ago

So, with the binary counter, you have a 4-bit input and 4-bit output. Since the ECP5 is natively LUT4, this means each output bit needs exactly one LUT4 to implement it, and then the result flops can be packed together with those LUTs, and the inp flops can be put in the same PLB. Propagation delay is minimal since it never leaves the PLB.

Meanwhile, the one-hot representation requires at least four PLBs, because there are only 8 FFs in a PLB. This means some amount of global routing is needed, which is already difficult. But the real problem is that this isn't truly one-hot: it's a priority decoder. That means that bit 1 must check the value of bit 0, and bit 2 must check the value of bits 0 and 1, bit 3 must check the value of bits 0, 1 and 2, and so on. This goes all the way to bit 15 which needs to ensure the 15 previous bits are zero first, which is implemented through multiple layers of logic and routing, which is slow.

In other words, this result is to be expected.