Open Ravenslofty opened 3 years ago
Interesting stuff. There are as you know a rich literature on this topic and this paper https://ieeexplore.ieee.org/document/1281800 seems to agree with your result (disclaimer: I have only read the abstract).
EDIT: I suspect LUT4 and LUT6 are preferred because 4:1 muxes map well to both, but are inefficient on LUT5.
That article is actually for a bit later in the series! I'll cover that when I get to interconnect.
On Sat, 10 Oct 2020, 01:20 Tommy Thorn, notifications@github.com wrote:
Interesting stuff. There are as you know a rich literature on this topic and this paper https://ieeexplore.ieee.org/document/1281800 seems to agree with your result (disclaimer: I have only read the abstract).
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Ravenslofty/blog/issues/2#issuecomment-706452797, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALPDWYOJ6K4U3VKPOKAGLTSJ6SDJANCNFSM4SKTOFLA .
FPGAs Are Magic II
More boilerplate (but we're getting somewhere!)
The goal of this post is the FPGA-flow related boilerplate, as opposed to the ASIC-flow related boilerplate of part I.
Our goal is to make the FPGA "good", but how do we quantify "good"?
We could measure the area of an FPGA design by multiplying the number of LUTs needed for it by the die area per LUT, giving us the die area needed to implement that design on the chip. Equally, we could measure the speed of an FPGA by calculating the critical-path delay of a design implemented for it.
I decided to target both: by multiplying the area of the design by its critical-path delay, you get a figure called the delay-area product (DAP), which you can minimise.
From Yosys' point of view, area statistics can be calculated from the
stat
(statistics) command; critical-path delay is a bit trickier. I opted to use thesta
(static timing analysis) command from Yosys'eddie/sta
branch, which re-uses timing information provided to ABC9 to calculate the critical path. Even though that branch is a little outdated (it was last rebased in May), it's modern enough for our purposes.Simulation models
Let's start off by writing simulation models of the LUT and DFF cells. I went up to LUT8s in my timing information, so here's a LUT8. We can simulate smaller LUTs by just tying inputs to constants and limiting the maximum LUT size ABC uses.
I'll call this file
rain_sim.v
. We'll need to reference it in the synthesis script.Mapping
ABC9 will produce a netlist which uses Yosys-internal
$lut
cells. We need to map those to our LUT model, using the Yosystechmap
pass that takes in a Verilog file.I'll call this file
rain_map.v
.Synthesis
Now we need another Yosys synthesis script, but this time for FPGAs instead of ASICs.
You'll probably want to write a script to vary the size of the LUT and change the timings as necessary. Then you can extract the relative LUT sizes from the output of
stat
and the critical path fromsta
. I picked a few of the benchmarks from the EPFL combinational benchmark suite and here are my results."Relative Speed" measures how fast the end result can go. We don't want to artificially hobble the chip by using a slow architecture.
"Relative Area" measures the total die area needed to implement that design. We don't want to spend an excessive amount of area for the architecture.
"Relative Delay-Area" measures how fast the end result is for its area. We want a design which provides the best performance for its area.
From the data, implementing designs using LUT2s is slower and less efficient than LUT3s, because you need a lot of them for the same design. Conversely, using LUT8s is slower and less efficient because it's difficult to use all of a LUT8, and all the logic results in slow switching speeds. Let's discount them both.
The LUT5 is the fastest architecture here, with the LUT7 a close second. This is surprising to me, because commercial FPGAs use LUT6s and LUT4s, and I was expecting these to be more competitive.
The LUT2 is the smallest design, with the LUT3 close behind. This makes sense; they're the smallest LUTs, and going from LUT2 to LUT3 is a significant efficiency boost.
The lack of performance for the LUT2 is punished by the delay-area product, and so the LUT3 is the best there.
If I was going to implement an FPGA from that data, I would pick the LUT5. It performs the best, and I think the performance of smaller LUTs will diminish when routing costs come into play.
But there are some other factors to be explored; you can combine smaller LUTs with fast muxes to increase performance, and you can use multiple outputs on large LUTs to reduce area.
In the next post, I will explore these.