Open Ravenslofty opened 4 years ago
So I made a small update to this, changing from a dltxp
to a dltxn
, because it turns out that it's more efficient to have a negative-true enable in a cell than a positive-true enable (which is converted to negative-true through a built-in inverter).
A software idiom is to not care about small efficiencies - the cell change only saves 3 units of area (um^2?) - but especially with larger LUTs you're going to be instantiating a lot of latches, and this change quickly adds up.
This is not actually my first attempt at an FPGA; my very first attempt used a shift register for the LUT, and that was unnecessarily big.
Could this be improved, size-wise? I think doing so would require designing custom cells, but I'm not even going to attempt that.
FPGAs Are Magic I
FPGAs are magic.
On my Twitter account, I have been posting diagrams of the Logic Parcel in a Hercules Micro HME-M7 and some routing that gets signals into the Logic Parcel, and the reactions were not the best.
So let's build one!
My only real prior knowledge of FPGA architecture is cursory knowledge of the Lattice iCE40 and ECP5 and Xilinx 7 Series, and some more detailed knowledge of the Intel Cyclone V and Hercules HME-M7. I am relatively familiar with the Yosys toolchain, however, and I'll be using that as a synthesis tool.
Implementing a LUT and flop in nMigen
To achieve anything, we need a LUT and a D flip-flop, and since I am a staunch advocate of ABC9, we need timings, too.
Let's imagine the most boring possible LUT with no carry logic, and the most boring possible D flip-flop with no init or resets.
Since we're experimenting, the LUT needs to be relatively flexibly designed. ASICs generally use latches instead of flops for storage, as they're smaller, so we'll use that for our LUT storage.
Using Yosys for ASIC synthesis.
Now we need Yosys to perform ASIC synthesis, and this process is...not very well documented, so here's my stab at it. I'm going to use the high-speed SkyWater cell library (
sky130_fd_sc_hs
), because this is a thought experiment and concerns like area and power usage don't matter to me right now.Which prints something like this:
Using OpenSTA for timing measurement
And then we need an OpenSTA script to print timing information about it.
If you missed the comment in the nMigen source, OpenSTA measures timing between synchronous endpoints, but LUTs are combinational, so we encase the LUT in flops to measure delays. This makes checking timing a bit messy.
This will give you entries that look like this.
Here, OpenSTA is printing:
We don't care about required time at all (which is why the clock length is zero) as this is an asynchronous logic element. Arrival time is important, however, as it contains the actual timings for the LUT.
The OpenSTA script reports timings from each input to the LUT output, and this is the data we'll need for ABC9, but also an annoying synchronous delay: flops naturally have a delay from when the clock edge rises to when the output changes. This delay (also called "arrival time") needs to be excluded from the timings.
The flop arrival time information is this line in the output:
Which tells us that there is a 2.43 nanosecond delay between the clock edge (the
/CLK
entry above it) and the flop output (Q
) changing.The total arrival time is in this line in the output:
Which tells us there is a 3.448 nanosecond delay between the clock edge and the LUT output changing.
To find the actual delay, we just subtract the flop arrival time from the total arrival time, to get
3.448ns - 2.43ns = 1.018ns
input to output delay.Rinse and repeat for the size of LUT you're interested in. Don't assume that the timings of the same LUT with an extra input look similar; the resulting network could have different delay characteristics.
Alternatively, here are some timings I made earlier.
In part II, I'll be going into how these timings can be used in a Yosys FPGA flow to test and measure improvements.