LUTRAM for CologneChip?!

chili-chips commented 1 month ago

Prompted by this result that puts GateMate at disadvantage compared to other FPGAs in its class, potentially downgrading it to a 10K LUT device (depending on design), we wonder if there is anything that can be done about this major shortcoming, both now, with as-is silicon, and in the future.

Given our upcoming utilization and stress tests, as well as comparisons with Gowin, LatticeSemi and Xilinx, we suspect that GateMate may come out short in most of them (while claiming 20-to-40K capacity on paper, it may fill up faster than a standard 20K device from other vendors).

Hence this call to CologneChip experts for help and advice on the approaches users can take for this architectural uniqueness.

pu-cc commented 1 month ago

downgrading it to a 10K LUT device

Where does that number come from?

pu-cc commented 1 month ago

Hence this call to CologneChip experts for help and advice on the approaches users can take for this architectural uniqueness.

There are a couple of things you could do, depending on your needs.

Use synchronous RAM, there is sufficient in GateMate
Use dual-port memory and simulate asynchronous read access
Use an asynchronous FIFO, GateMate has has built-in FIFOs
If the design is not area-critical, let the synthesis emulate it using FFs

chili-chips-ba commented 1 month ago

Any plans for adding LUTRAM support to future silicon devices? Block RAM is typically insufficient in use cases with lots of small storage elements sprinkled all over the chip.

10K is an arbitrary SWAG, based on the already established datapoint for OPL3 FPGA design:

GateMate: 2842 DFFs
others: 1635 DFFs
GateMate DFF efficiency: 1635 / 2842 = 57%, i.e. close to half compared to the others.

The trouble is that building RAM from DFFs consumes not only flops, but also the LUTs and wiring. Altogether, it may create routing choke points that restrict access to resources that are nominally "free".

tarik-ibrahimovic commented 3 weeks ago

We've conducted tests on Gowin and CologneChip FPGAs for respective "LUTRAM" capacities. The results and the reproduction steps are published on this repo.

Tests indicate that emulating LUTRAM behaviour is costly on GateMate FPGAs and the capacities are 10x less than ones of a comparable 20k LUT FPGA. Apart from this, there is an underutilization thing happening here: GateMate chip uses a tenth of the available CPE before failing in routing.

chili-chips-ba commented 3 weeks ago

... as predicted, the cost of GateMate LUTRAM emulation in discrete FFs is very high and twofold: 1) explosion of spent flip-flops 2) routing congestion that prevents access to resources that are nominally not spent.

Is there anything that can be improved about the latter?!

We are thinking about:

more advanced PNR routing algorithms
or unlocking secret connectivity paths that you've built into silicon, and are yet to fully activate.

chili-chips-ba commented 1 week ago

@DadoCCAG, to bring this catastrophic result up to the management attention for the next spin of the chip

True LUTRAM would be a very, very welcome addition to GateMate silicon, one that would bring it closer to the mainstream devices

chili-chips-ba commented 3 days ago

@tarik-ibrahimovic any insights you can share on the benefit (or not) of this synthesis switch?

pu-cc commented 3 days ago

@chili-chips-ba the experimental -luttree feature has no relation to lutrams.

chili-chips-ba commented 3 days ago

... how about helping us fully understand what this experimental switch is trying to achieve!

Let's also note that its name is a bit misleading, given that GateMate does not appear to have traditional LUTs, but rather MUX trees. Or, asked differently, if GateMate had the LUTs, why does it not have the LUTRAMs?

pu-cc commented 3 days ago

I assumed that this issue is only LUTRAM-related. Nevertheless:

GateMate does not appear to have traditional LUTs, but rather MUX trees

Where does this information come from? It is wrong: It is a tree of LUT2s - with 4 configruation bits each - and is described in detail in the Primitives Library from page 53.

Let's also note that its name is a bit misleading,

Please see above, since it is LUT2 in a tree structure, the name is already very appropriate.

or unlocking secret connectivity paths that you've built into silicon, and are yet to fully activate.

There is no way to read the config bits, that we store in latches, of the LUTs back. Furthermore there is no decoding logic that can be used by the user. Therefore there is also no possibility for a LUTRAM implementation. But again: there are alternatives, as I have shown in https://github.com/chili-chips-ba/openCologne/issues/28#issuecomment-2263666636.

About the LUT-tree itself. We are currently using yosys/abc to map combinatorial logic into typical LUT4. P&R analyzes it and maps it into the LUT-trees. This is certainly not the best way to go. Yosys already supports the mapping with the -luttree flag. Instead of mapping into 8-input functions, which is too computationally intensive, we have opted for the L2T4 (LUT-tree with 4 inputs) and L2T5 (LUT-tree with 5 inputs) approach. The exact structure is documented in the Primitives Library. The approach is smart, not computationally intensive and maps logic directly to our architecture.

Why did we decide to go with a LUT-tree? It requires fewer config bits than a standard LUT4 or LUT6. It also requires less space in the silicon.

Not all features are yet supported in the P&R. That's why we marked it experimental. As soon as we finish its implementation, it certainly makes sense to activate the feature in yosys by default. Nextpnr for GateMate will only support L2T4 and L2T5.

chili-chips-ba commented 3 days ago

Thank you, this is all very informative. The confusion is partly from the statements in the press about GateMate falling into 40K LUT4, and even 20K LUT8 category! The latter looks unique, as there are currently no FPGA devices built with LUT8 components.

L2T4 indeed needs 12 instead of 16 configuration latches. However, is a true LUT4 otherwise better than the L2T4?

Are there 4-input logic combinations that cannot be realized with L2T4 structure?
What is the cost in timing closure from additional level of logic?

Similarly, are there 8-input logic combinations that cannot be realized with L2T4 and L2T5 structure shown above?

When it comes to the LUTRAM alternatives, they are simply too expensive, or too scarce / too coarse-grained for most cases. Tarik's experiment has shown us the following:

1) GateMate capacity is 10x less than a comparable 20K LUT4 FPGA => It's essentially like a 2K LUT4 device in this particular aspect 😞 2) We on top of that run into under-utilization, i.e. start to fail routing with only a tenth of available CPEs used 😞

pu-cc commented 2 days ago

The confusion is partly from the statements in the press about GateMate falling into 40K LUT4, and even 20K LUT8 category!

Sorry, this is probably off-topic by now, but there is no Cologne Chip press in which we claim to have LUT8. The official language regulation is 4/8-input LUT-tree.

L2T4 indeed needs 12 instead of 16 configuration latches. However, is a true LUT4 otherwise better than the L2T4?

I don't know how to rate "better". In purely mathematical terms, there are obviously fewer functions that can be implemented with it. In general, it's hard to say whether this is a limitation, as not every design utilizes the full scope of the LUTs. Please take a look at the official code, which should clarify many questions.

GateMate capacity is 10x less than a comparable 20K LUT4 FPGA

Once again, under what conditions? If I fill the entire chip with LUTRAM, maybe. But then I would first think about whether it's really that clever to implement it like this, and possibly use the built-in RAM, of which there is a lot more.

We on top of that run into under-utilization, i.e. start to fail routing with only a tenth of available CPEs used

Can you open an issue for this? We've been doing stress test with almost full utilization, i.e. with https://github.com/stnolting/fpga_torture.

chili-chips-ba commented 2 days ago

How do you feel about publishing an analysis of the 4-input logic functions that the L2T4 cannot implement, along with the corresponding percentage wrt traditional LUT4?!

chili-chips-ba / openCologne

LUTRAM for CologneChip?! #28