enjoy-digital / litex_agilex5_test

Initial Test/Support of LiteX on Intel Agilex5 FPGAs.
3 stars 0 forks source link

Resources Usage #2

Open trabucayre opened 3 months ago

trabucayre commented 3 months ago

VexRiscv SMP

Intel Agilex5E board

time ./intel_agilex5e_065b_premium_devkit.py --cpu-type=vexriscv_smp --build --with-rvc --with-fpu --with-wishbone-memory
real    7m54.943s
user    18m14.855s
sys     0m9.737s
Resource Usage %
Usage Report Generated after: Place
Logic utilization (ALMs needed / total ALMs on device) 6,341 / 222,400 3 %
ALMs needed [=A-B+C] 6,341
[A] ALMs used in final placement [=a+b+c+d] 7,538 / 222,400 3 %
[a] ALMs used for LUT logic and register circuitry 2,345
[b] ALMs used for LUT logic 3,752
[c] ALMs used for register circuitry 1,441
[d] ALMs used for memory (up to half of total ALMs) 0
[B] Estimate of ALMs recoverable by dense packing 1,230 / 222,400 < 1 %
[C] Estimate of ALMs unavailable [=a+b+c+d] 33 / 222,400 < 1 %
[a] Due to location constrained logic 0
[b] Due to LAB-wide signal conflicts 0
[c] Due to LAB input limits 33
[d] Due to virtual I/Os 0
Difficulty packing design Low
Total LABs: partially or completely used 897 / 22,240 4 %
-- Logic LABs 897
-- Memory LABs (up to half of total LABs) 0
Combinational ALUT usage for logic 9,831
-- 8 input functions 296
-- 7 input functions 69
-- 6 input functions 892
-- 5 input functions 2,024
-- 4 input functions 1,556
-- <=3 input functions 4,994
Combinational ALUT usage for route-throughs 1,132
Dedicated logic registers 8,895
-- By type:
-- LAB logic registers:
-- Primary logic registers 7,571 / 444,800 2 %
-- Secondary logic registers 734 / 444,800 < 1 %
-- Hyper-Registers: 590
Register control circuitry for power estimation 0
ALMs adjustment for power estimation 730
I/O pins 5 / 624 < 1 %
-- Clock pins 0 / 44 0 %
-- Dedicated input pins 3 / 54 6 %
M20K blocks 31 / 1,611 2 %
Total MLAB memory bits 0
Total block memory bits 297,952 / 32,993,280 < 1 %
Total block memory implementation bits 634,880 / 32,993,280 2 %
DSP Blocks Needed [=A+B+C-D] 7 / 846 < 1 %
[A] Total Fixed Point DSP Blocks 13
[B] Total Floating Point DSP Blocks 0
[C] Total DSP_PRIME Blocks 0
[D] Estimate of DSP Blocks recoverable by dense merging 6
IOPLLs 0 / 15 0 %
Global signals 1
LVDS_RX blocks 0 / 192 0 %
HMC blocks 0 / 8 0 %
Maximum fan-out 9553
Highest non-global fan-out 503
Total fan-out 72506
Average fan-out 3.75
trabucayre commented 3 months ago

naxRiscv 32bits

Intel Agilex5E board

./intel_agilex5e_065b_premium_devkit.py --cpu-type=naxriscv --with-fpu --with-rvc --build
real    40m10.748s
user    113m37.792s
sys     1m12.900s
Resource Usage %
Usage Report Generated after: Place
Logic utilization (ALMs needed / total ALMs on device) 65,574 / 222,400 29 %
ALMs needed [=A-B+C] 65,574
[A] ALMs used in final placement [=a+b+c+d] 80,052 / 222,400 36 %
[a] ALMs used for LUT logic and register circuitry 7,572
[b] ALMs used for LUT logic 49,323
[c] ALMs used for register circuitry 23,027
[d] ALMs used for memory (up to half of total ALMs) 130
[B] Estimate of ALMs recoverable by dense packing 14,731 / 222,400 7 %
[C] Estimate of ALMs unavailable [=a+b+c+d] 253 / 222,400 < 1 %
[a] Due to location constrained logic 0
[b] Due to LAB-wide signal conflicts 0
[c] Due to LAB input limits 253
[d] Due to virtual I/Os 0
Difficulty packing design Low
Total LABs: partially or completely used 13,842 / 22,240 62 %
-- Logic LABs 13,829
-- Memory LABs (up to half of total LABs) 13
Combinational ALUT usage for logic 70,349
-- 8 input functions 13,013
-- 7 input functions 175
-- 6 input functions 18,159
-- 5 input functions 18,685
-- 4 input functions 9,008
-- <=3 input functions 11,309
Combinational ALUT usage for route-throughs 7,884
Memory ALUT usage 73
-- 64-address deep 0
-- 32-address deep 0
Dedicated logic registers 64,853
-- By type:
-- LAB logic registers:
-- Primary logic registers 61,197 / 444,800 14 %
-- Secondary logic registers 2,441 / 444,800 < 1 %
-- Hyper-Registers: 1,215
Register control circuitry for power estimation 260
ALMs adjustment for power estimation 16,160
I/O pins 5 / 624 < 1 %
-- Clock pins 0 / 44 0 %
-- Dedicated input pins 3 / 54 6 %
M20K blocks 405 / 1,611 25 %
Total MLAB memory bits 2,160
Total block memory bits 1,718,140 / 32,993,280 5 %
Total block memory implementation bits 8,294,400 / 32,993,280 25 %
DSP Blocks Needed [=A+B+C-D] 7 / 846 < 1 %
[A] Total Fixed Point DSP Blocks 13
[B] Total Floating Point DSP Blocks 0
[C] Total DSP_PRIME Blocks 0
[D] Estimate of DSP Blocks recoverable by dense merging 6
IOPLLs 0 / 15 0 %
Global signals 1
LVDS_RX blocks 0 / 192 0 %
HMC blocks 0 / 8 0 %
Maximum fan-out 68871
Highest non-global fan-out 2174
Total fan-out 622733
Average fan-out 4.48
trabucayre commented 3 months ago

naxRiscv 64bits

time ./intel_agilex5e_065b_premium_devkit.py --cpu-type=naxriscv --xlen=64 --with-fpu --with-rvc --build
real    55m39.381s
user    157m54.672s
sys     2m9.537s
Resource Usage %
Usage Report Generated after: Place
Logic utilization (ALMs needed / total ALMs on device) 86,509 / 222,400 39 %
ALMs needed [=A-B+C] 86,509
[A] ALMs used in final placement [=a+b+c+d] 104,414 / 222,400 47 %
[a] ALMs used for LUT logic and register circuitry 8,717
[b] ALMs used for LUT logic 67,225
[c] ALMs used for register circuitry 28,342
[d] ALMs used for memory (up to half of total ALMs) 130
[B] Estimate of ALMs recoverable by dense packing 18,239 / 222,400 8 %
[C] Estimate of ALMs unavailable [=a+b+c+d] 334 / 222,400 < 1 %
[a] Due to location constrained logic 0
[b] Due to LAB-wide signal conflicts 0
[c] Due to LAB input limits 334
[d] Due to virtual I/Os 0
Difficulty packing design Low
Total LABs: partially or completely used 18,538 / 22,240 83 %
-- Logic LABs 18,525
-- Memory LABs (up to half of total LABs) 13
Combinational ALUT usage for logic 91,513
-- 8 input functions 18,566
-- 7 input functions 246
-- 6 input functions 25,000
-- 5 input functions 24,771
-- 4 input functions 9,873
-- <=3 input functions 13,057
Combinational ALUT usage for route-throughs 9,270
Memory ALUT usage 73
-- 64-address deep 0
-- 32-address deep 0
Dedicated logic registers 77,750
-- By type:
-- LAB logic registers:
-- Primary logic registers 74,117 / 444,800 17 %
-- Secondary logic registers 2,365 / 444,800 < 1 %
-- Hyper-Registers: 1,268
Register control circuitry for power estimation 260
ALMs adjustment for power estimation 20,844
I/O pins 5 / 624 < 1 %
-- Clock pins 0 / 44 0 %
-- Dedicated input pins 3 / 54 6 %
M20K blocks 411 / 1,611 26 %
Total MLAB memory bits 2,160
Total block memory bits 1,728,444 / 32,993,280 5 %
Total block memory implementation bits 8,417,280 / 32,993,280 26 %
DSP Blocks Needed [=A+B+C-D] 13 / 846 2 %
[A] Total Fixed Point DSP Blocks 25
[B] Total Floating Point DSP Blocks 0
[C] Total DSP_PRIME Blocks 0
[D] Estimate of DSP Blocks recoverable by dense merging 12
IOPLLs 0 / 15 0 %
Global signals 1
LVDS_RX blocks 0 / 192 0 %
HMC blocks 0 / 8 0 %
Maximum fan-out 81892
Highest non-global fan-out 3078
Total fan-out 787253
Average fan-out 4.54
enjoy-digital commented 3 months ago

NaxRiscv 32-bit on Artix7:

./sqrl_acorn.py --integrated-main-ram-size=0x100 --cpu-type=naxriscv --with-fpu --with-rvc --build

1. Slice Logic
--------------

+----------------------------+-------+-------+------------+-----------+-------+
|          Site Type         |  Used | Fixed | Prohibited | Available | Util% |
+----------------------------+-------+-------+------------+-----------+-------+
| Slice LUTs                 | 23597 |     0 |        800 |    133800 | 17.64 |
|   LUT as Logic             | 18677 |     0 |        800 |    133800 | 13.96 |
|   LUT as Memory            |  4920 |     0 |          0 |     46200 | 10.65 |
|     LUT as Distributed RAM |  4874 |     0 |            |           |       |
|     LUT as Shift Register  |    46 |     0 |            |           |       |
| Slice Registers            | 17373 |     0 |          0 |    269200 |  6.45 |
|   Register as Flip Flop    | 17373 |     0 |          0 |    269200 |  6.45 |
|   Register as Latch        |     0 |     0 |          0 |    269200 |  0.00 |
| F7 Muxes                   |   278 |     0 |        400 |     66900 |  0.42 |
| F8 Muxes                   |     2 |     0 |        200 |     33450 | <0.01 |
+----------------------------+-------+-------+------------+-----------+-------+

1.1 Summary of Registers by Type
--------------------------------

+-------+--------------+-------------+--------------+
| Total | Clock Enable | Synchronous | Asynchronous |
+-------+--------------+-------------+--------------+
| 0     |            _ |           - |            - |
| 0     |            _ |           - |          Set |
| 0     |            _ |           - |        Reset |
| 0     |            _ |         Set |            - |
| 0     |            _ |       Reset |            - |
| 0     |          Yes |           - |            - |
| 87    |          Yes |           - |          Set |
| 1198  |          Yes |           - |        Reset |
| 90    |          Yes |         Set |            - |
| 15998 |          Yes |       Reset |            - |
+-------+--------------+-------------+--------------+

2. Slice Logic Distribution
---------------------------

+--------------------------------------------+-------+-------+------------+-----------+-------+
|                  Site Type                 |  Used | Fixed | Prohibited | Available | Util% |
+--------------------------------------------+-------+-------+------------+-----------+-------+
| Slice                                      |  8074 |     0 |        200 |     33450 | 24.14 |
|   SLICEL                                   |  4886 |     0 |            |           |       |
|   SLICEM                                   |  3188 |     0 |            |           |       |
| LUT as Logic                               | 18677 |     0 |        800 |    133800 | 13.96 |
|   using O5 output only                     |     3 |       |            |           |       |
|   using O6 output only                     | 15109 |       |            |           |       |
|   using O5 and O6                          |  3565 |       |            |           |       |
| LUT as Memory                              |  4920 |     0 |          0 |     46200 | 10.65 |
|   LUT as Distributed RAM                   |  4874 |     0 |            |           |       |
|     using O5 output only                   |     8 |       |            |           |       |
|     using O6 output only                   |  3206 |       |            |           |       |
|     using O5 and O6                        |  1660 |       |            |           |       |
|   LUT as Shift Register                    |    46 |     0 |            |           |       |
|     using O5 output only                   |     9 |       |            |           |       |
|     using O6 output only                   |    12 |       |            |           |       |
|     using O5 and O6                        |    25 |       |            |           |       |
| Slice Registers                            | 17373 |     0 |          0 |    269200 |  6.45 |
|   Register driven from within the Slice    |  7661 |       |            |           |       |
|   Register driven from outside the Slice   |  9712 |       |            |           |       |
|     LUT in front of the register is unused |  5072 |       |            |           |       |
|     LUT in front of the register is used   |  4640 |       |            |           |       |
| Unique Control Sets                        |   466 |       |        200 |     33450 |  1.39 |
+--------------------------------------------+-------+-------+------------+-----------+-------+
* * Note: Available Control Sets calculated as Slice * 1, Review the Control Sets Report for more information regarding control sets.

3. Memory
---------

+-------------------+------+-------+------------+-----------+-------+
|     Site Type     | Used | Fixed | Prohibited | Available | Util% |
+-------------------+------+-------+------------+-----------+-------+
| Block RAM Tile    | 63.5 |     0 |          0 |       365 | 17.40 |
|   RAMB36/FIFO*    |   57 |     0 |          0 |       365 | 15.62 |
|     RAMB36E1 only |   57 |       |            |           |       |
|   RAMB18          |   13 |     0 |          0 |       730 |  1.78 |
|     RAMB18E1 only |   13 |       |            |           |       |
+-------------------+------+-------+------------+-----------+-------+
* Note: Each Block RAM Tile only has one FIFO logic available and therefore can accommodate only one FIFO36E1 or one FIFO18E1. However, if a FIFO18E1 occupies a Block RAM Tile, that tile can still accommodate a RAMB18E1

4. DSP
------

+----------------+------+-------+------------+-----------+-------+
|    Site Type   | Used | Fixed | Prohibited | Available | Util% |
+----------------+------+-------+------------+-----------+-------+
| DSPs           |   13 |     0 |          0 |       740 |  1.76 |
|   DSP48E1 only |   13 |       |            |           |       |
+----------------+------+-------+------------+-----------+-------+
enjoy-digital commented 3 months ago

VexRiscv SMP on Artix7:

./sqrl_acorn.py --integrated-main-ram-size=0x100 --cpu-type=vexriscv_smp --build --with-rvc --with-fpu --with-wishbone-memory

1. Slice Logic
--------------

+----------------------------+------+-------+------------+-----------+-------+
|          Site Type         | Used | Fixed | Prohibited | Available | Util% |
+----------------------------+------+-------+------------+-----------+-------+
| Slice LUTs                 | 7755 |     0 |        800 |    133800 |  5.80 |
|   LUT as Logic             | 7644 |     0 |        800 |    133800 |  5.71 |
|   LUT as Memory            |  111 |     0 |          0 |     46200 |  0.24 |
|     LUT as Distributed RAM |   76 |     0 |            |           |       |
|     LUT as Shift Register  |   35 |     0 |            |           |       |
| Slice Registers            | 7840 |     0 |          0 |    269200 |  2.91 |
|   Register as Flip Flop    | 7840 |     0 |          0 |    269200 |  2.91 |
|   Register as Latch        |    0 |     0 |          0 |    269200 |  0.00 |
| F7 Muxes                   |   48 |     0 |        400 |     66900 |  0.07 |
| F8 Muxes                   |    5 |     0 |        200 |     33450 |  0.01 |
+----------------------------+------+-------+------------+-----------+-------+

1.1 Summary of Registers by Type
--------------------------------

+-------+--------------+-------------+--------------+
| Total | Clock Enable | Synchronous | Asynchronous |
+-------+--------------+-------------+--------------+
| 0     |            _ |           - |            - |
| 0     |            _ |           - |          Set |
| 0     |            _ |           - |        Reset |
| 0     |            _ |         Set |            - |
| 0     |            _ |       Reset |            - |
| 0     |          Yes |           - |            - |
| 36    |          Yes |           - |          Set |
| 511   |          Yes |           - |        Reset |
| 120   |          Yes |         Set |            - |
| 7173  |          Yes |       Reset |            - |
+-------+--------------+-------------+--------------+

2. Slice Logic Distribution
---------------------------

+--------------------------------------------+------+-------+------------+-----------+-------+
|                  Site Type                 | Used | Fixed | Prohibited | Available | Util% |
+--------------------------------------------+------+-------+------------+-----------+-------+
| Slice                                      | 3023 |     0 |        200 |     33450 |  9.04 |
|   SLICEL                                   | 1986 |     0 |            |           |       |
|   SLICEM                                   | 1037 |     0 |            |           |       |
| LUT as Logic                               | 7644 |     0 |        800 |    133800 |  5.71 |
|   using O5 output only                     |    8 |       |            |           |       |
|   using O6 output only                     | 5779 |       |            |           |       |
|   using O5 and O6                          | 1857 |       |            |           |       |
| LUT as Memory                              |  111 |     0 |          0 |     46200 |  0.24 |
|   LUT as Distributed RAM                   |   76 |     0 |            |           |       |
|     using O5 output only                   |    0 |       |            |           |       |
|     using O6 output only                   |   20 |       |            |           |       |
|     using O5 and O6                        |   56 |       |            |           |       |
|   LUT as Shift Register                    |   35 |     0 |            |           |       |
|     using O5 output only                   |    3 |       |            |           |       |
|     using O6 output only                   |    0 |       |            |           |       |
|     using O5 and O6                        |   32 |       |            |           |       |
| Slice Registers                            | 7840 |     0 |          0 |    269200 |  2.91 |
|   Register driven from within the Slice    | 3411 |       |            |           |       |
|   Register driven from outside the Slice   | 4429 |       |            |           |       |
|     LUT in front of the register is unused | 2520 |       |            |           |       |
|     LUT in front of the register is used   | 1909 |       |            |           |       |
| Unique Control Sets                        |  172 |       |        200 |     33450 |  0.51 |
+--------------------------------------------+------+-------+------------+-----------+-------+
* * Note: Available Control Sets calculated as Slice * 1, Review the Control Sets Report for more information regarding control sets.

3. Memory
---------

+-------------------+------+-------+------------+-----------+-------+
|     Site Type     | Used | Fixed | Prohibited | Available | Util% |
+-------------------+------+-------+------------+-----------+-------+
| Block RAM Tile    | 16.5 |     0 |          0 |       365 |  4.52 |
|   RAMB36/FIFO*    |   15 |     0 |          0 |       365 |  4.11 |
|     RAMB36E1 only |   15 |       |            |           |       |
|   RAMB18          |    3 |     0 |          0 |       730 |  0.41 |
|     RAMB18E1 only |    3 |       |            |           |       |
+-------------------+------+-------+------------+-----------+-------+
* Note: Each Block RAM Tile only has one FIFO logic available and therefore can accommodate only one FIFO36E1 or one FIFO18E1. However, if a FIFO18E1 occupies a Block RAM Tile, that tile can still accommodate a RAMB18E1

4. DSP
------

+----------------+------+-------+------------+-----------+-------+
|    Site Type   | Used | Fixed | Prohibited | Available | Util% |
+----------------+------+-------+------------+-----------+-------+
| DSPs           |   13 |     0 |          0 |       740 |  1.76 |
|   DSP48E1 only |   13 |       |            |           |       |
+----------------+------+-------+------------+-----------+-------+
trabucayre commented 3 months ago

VexiiRiscv 32bits (rv32imafc) on Intel Agilex5E board

time ./intel_agilex5e_065b_premium_devkit.py  --bus-standard=axi-lite \
    --cpu-type=vexiiriscv --cpu-variant=linux --vexii-args="--with-rvc --with-rvf" \
    --with-coherent-dma --build
Resource Usage %
Usage Report Generated after: Place
Logic utilization (ALMs needed / total ALMs on device) 9,609 / 222,400 4 %
ALMs needed [=A-B+C] 9,609
[A] ALMs used in final placement [=a+b+c+d] 12,073 / 222,400 5 %
[a] ALMs used for LUT logic and register circuitry 2,494
[b] ALMs used for LUT logic 6,137
[c] ALMs used for register circuitry 3,442
[d] ALMs used for memory (up to half of total ALMs) 0
[B] Estimate of ALMs recoverable by dense packing 2,501 / 222,400 1 %
[C] Estimate of ALMs unavailable [=a+b+c+d] 37 / 222,400 < 1 %
[a] Due to location constrained logic 0
[b] Due to LAB-wide signal conflicts 0
[c] Due to LAB input limits 37
[d] Due to virtual I/Os 0
Difficulty packing design Low
Total LABs: partially or completely used 1,553 / 22,240 7 %
-- Logic LABs 1,553
-- Memory LABs (up to half of total LABs) 0
Combinational ALUT usage for logic 13,086
-- 8 input functions 607
-- 7 input functions 107
-- 6 input functions 1,710
-- 5 input functions 3,141
-- 4 input functions 2,550
-- <=3 input functions 4,971
Combinational ALUT usage for route-throughs 1,488
Dedicated logic registers 12,671
-- By type:
-- LAB logic registers:
-- Primary logic registers 11,872 / 444,800 3 %
-- Secondary logic registers 740 / 444,800 < 1 %
-- Hyper-Registers: 59
Register control circuitry for power estimation 0
ALMs adjustment for power estimation 1,947
I/O pins 5 / 624 < 1 %
-- Clock pins 0 / 44 0 %
-- Dedicated input pins 3 / 54 6 %
M20K blocks 67 / 1,611 4 %
Total MLAB memory bits 0
Total block memory bits 567,952 / 32,993,280 2 %
Total block memory implementation bits 1,372,160 / 32,993,280 4 %
DSP Blocks Needed [=A+B+C-D] 2 / 846 < 1 %
[A] Total Fixed Point DSP Blocks 4
[B] Total Floating Point DSP Blocks 0
[C] Total DSP_PRIME Blocks 0
[D] Estimate of DSP Blocks recoverable by dense merging 2
IOPLLs 0 / 15 0 %
Global signals 1
LVDS_RX blocks 0 / 192 0 %
HMC blocks 0 / 8 0 %
Maximum fan-out 14107
Highest non-global fan-out 3342
Total fan-out 116227
Average fan-out 4.27
Dolu1990 commented 3 months ago

Hi ^^

Logic utilization (ALMs needed / total ALMs on device) 65,574 / 222,400

So yes, now i'm sure there is some memory not being properly infered into mlab / blockram.

Especialy when comparing the number of registers : Dedicated logic registers 64,853 vs Slice Registers | 17373

I will take a look at this.

trabucayre commented 3 months ago

VexiiRiscv 32bits (rv32imafc) on Artix7

./litex_acorn_baseboard_mini.py --sys-clk-freq 100e6 --bus-standard=axi-lite --cpu-type=vexiiriscv --cpu-variant=linux --vexii-args="--with-rvc --with-rvf" --with-coherent-dma --build --integrated-main-ram-size=0x100

1. Slice Logic
--------------

+----------------------------+------+-------+------------+-----------+-------+
|          Site Type         | Used | Fixed | Prohibited | Available | Util% |
+----------------------------+------+-------+------------+-----------+-------+
| Slice LUTs                 | 8844 |     0 |        800 |    133800 |  6.61 |
|   LUT as Logic             | 8546 |     0 |        800 |    133800 |  6.39 |
|   LUT as Memory            |  298 |     0 |          0 |     46200 |  0.65 |
|     LUT as Distributed RAM |  226 |     0 |            |           |       |
|     LUT as Shift Register  |   72 |     0 |            |           |       |
| Slice Registers            | 7705 |     0 |          0 |    269200 |  2.86 |
|   Register as Flip Flop    | 7705 |     0 |          0 |    269200 |  2.86 |
|   Register as Latch        |    0 |     0 |          0 |    269200 |  0.00 |
| F7 Muxes                   |    0 |     0 |        400 |     66900 |  0.00 |
| F8 Muxes                   |    0 |     0 |        200 |     33450 |  0.00 |
+----------------------------+------+-------+------------+-----------+-------+

1.1 Summary of Registers by Type
--------------------------------

+-------+--------------+-------------+--------------+
| Total | Clock Enable | Synchronous | Asynchronous |
+-------+--------------+-------------+--------------+
| 0     |            _ |           - |            - |
| 0     |            _ |           - |          Set |
| 0     |            _ |           - |        Reset |
| 0     |            _ |         Set |            - |
| 0     |            _ |       Reset |            - |
| 0     |          Yes |           - |            - |
| 47    |          Yes |           - |          Set |
| 626   |          Yes |           - |        Reset |
| 81    |          Yes |         Set |            - |
| 6951  |          Yes |       Reset |            - |
+-------+--------------+-------------+--------------+

2. Slice Logic Distribution
---------------------------

+--------------------------------------------+------+-------+------------+-----------+-------+
|                  Site Type                 | Used | Fixed | Prohibited | Available | Util% |
+--------------------------------------------+------+-------+------------+-----------+-------+
| Slice                                      | 3001 |     0 |        200 |     33450 |  8.97 |
|   SLICEL                                   | 1980 |     0 |            |           |       |
|   SLICEM                                   | 1021 |     0 |            |           |       |
| LUT as Logic                               | 8546 |     0 |        800 |    133800 |  6.39 |
|   using O5 output only                     |    2 |       |            |           |       |
|   using O6 output only                     | 7105 |       |            |           |       |
|   using O5 and O6                          | 1439 |       |            |           |       |
| LUT as Memory                              |  298 |     0 |          0 |     46200 |  0.65 |
|   LUT as Distributed RAM                   |  226 |     0 |            |           |       |
|     using O5 output only                   |    0 |       |            |           |       |
|     using O6 output only                   |    2 |       |            |           |       |
|     using O5 and O6                        |  224 |       |            |           |       |
|   LUT as Shift Register                    |   72 |     0 |            |           |       |
|     using O5 output only                   |   29 |       |            |           |       |
|     using O6 output only                   |   19 |       |            |           |       |
|     using O5 and O6                        |   24 |       |            |           |       |
| Slice Registers                            | 7705 |     0 |          0 |    269200 |  2.86 |
|   Register driven from within the Slice    | 3877 |       |            |           |       |
|   Register driven from outside the Slice   | 3828 |       |            |           |       |
|     LUT in front of the register is unused | 1921 |       |            |           |       |
|     LUT in front of the register is used   | 1907 |       |            |           |       |
| Unique Control Sets                        |  172 |       |        200 |     33450 |  0.51 |
+--------------------------------------------+------+-------+------------+-----------+-------+
* * Note: Available Control Sets calculated as Slice * 1, Review the Control Sets Report for more information regarding control sets.

3. Memory
---------

+-------------------+------+-------+------------+-----------+-------+
|     Site Type     | Used | Fixed | Prohibited | Available | Util% |
+-------------------+------+-------+------------+-----------+-------+
| Block RAM Tile    | 35.5 |     0 |          0 |       365 |  9.73 |
|   RAMB36/FIFO*    |   23 |     0 |          0 |       365 |  6.30 |
|     RAMB36E1 only |   23 |       |            |           |       |
|   RAMB18          |   25 |     0 |          0 |       730 |  3.42 |
|     RAMB18E1 only |   25 |       |            |           |       |
+-------------------+------+-------+------------+-----------+-------+
* Note: Each Block RAM Tile only has one FIFO logic available and therefore can accommodate only one FIFO36E1 or one FIFO18E1. However, if a FIFO18E1 occupies a Block RAM Tile, that tile can still accommodate a RAMB18E1

4. DSP
------

+----------------+------+-------+------------+-----------+-------+
|    Site Type   | Used | Fixed | Prohibited | Available | Util% |
+----------------+------+-------+------------+-----------+-------+
| DSPs           |    4 |     0 |          0 |       740 |  0.54 |
|   DSP48E1 only |    4 |       |            |           |       |
+----------------+------+-------+------------+-----------+-------+
Dolu1990 commented 3 months ago

Hmm same story for vexiiRiscv, it is less showing than for naxriscv, but for sure, some memory aren't being infered as memory. I will take a look at that aswell.

Dolu1990 commented 3 months ago

Found a few issues :

This reduce the gap quite a lot. So, still, with a debian capable dual core, with very very relaxed timings and retiming disable, i still get 4k more register used (16% more) than on Artix 7. Which is weird as register usage is kinda a clean metric to compare things I'm working on that.

enjoy-digital commented 3 months ago

Thanks @Dolu1990 for the first analysis!

Dolu1990 commented 3 months ago

I pushed the current WIP (https://github.com/enjoy-digital/litex/pull/2011)

enjoy-digital commented 3 months ago

Good, thanks!

Dolu1990 commented 3 months ago
./intel_agilex5e_065b_premium_devkit.py --cpu-type=naxriscv --with-fpu --with-rvc --build
real    40m10.748s
user    113m37.792s
sys     1m12.900s
+----------------------------------------------------------------------------------------------+
; Fitter Resource Usage Summary                                                                ;
+-------------------------------------------------------------+------------------------+-------+
; Resource                                                    ; Usage                  ; %     ;
+-------------------------------------------------------------+------------------------+-------+
; Usage Report Generated after: Place                         ;                        ;       ;
;                                                             ;                        ;       ;
; Logic utilization (ALMs needed / total ALMs on device)      ; 30,153 / 222,400       ; 14 %  ;
; ALMs needed [=A-B+C]                                        ; 30,153                 ;       ;
;     [A] ALMs used in final placement [=a+b+c+d]             ; 34,334 / 222,400       ; 15 %  ;
;         [a] ALMs used for LUT logic and register circuitry  ; 6,155                  ;       ;
;         [b] ALMs used for LUT logic                         ; 12,787                 ;       ;
;         [c] ALMs used for register circuitry                ; 4,752                  ;       ;
;         [d] ALMs used for memory (up to half of total ALMs) ; 10,640                 ;       ;
;     [B] Estimate of ALMs recoverable by dense packing       ; 4,342 / 222,400        ; 2 %   ;
;     [C] Estimate of ALMs unavailable [=a+b+c+d]             ; 161 / 222,400          ; < 1 % ;
;         [a] Due to location constrained logic               ; 0                      ;       ;
;         [b] Due to LAB-wide signal conflicts                ; 30                     ;       ;
;         [c] Due to LAB input limits                         ; 131                    ;       ;
;         [d] Due to virtual I/Os                             ; 0                      ;       ;
;                                                             ;                        ;       ;
; Difficulty packing design                                   ; Low                    ;       ;
;                                                             ;                        ;       ;
; Total LABs:  partially or completely used                   ; 4,073 / 22,240         ; 18 %  ;
;     -- Logic LABs                                           ; 3,009                  ;       ;
;     -- Memory LABs (up to half of total LABs)               ; 1,064                  ;       ;
;                                                             ;                        ;       ;
; Combinational ALUT usage for logic                          ; 30,003                 ;       ;
;     -- 8 input functions                                    ; 1,299                  ;       ;
;     -- 7 input functions                                    ; 131                    ;       ;
;     -- 6 input functions                                    ; 2,429                  ;       ;
;     -- 5 input functions                                    ; 6,582                  ;       ;
;     -- 4 input functions                                    ; 6,888                  ;       ;
;     -- <=3 input functions                                  ; 12,674                 ;       ;
; Combinational ALUT usage for route-throughs                 ; 2,582                  ;       ;
; Memory ALUT usage                                           ; 6,675                  ;       ;
;     -- 64-address deep                                      ; 0                      ;       ;
;     -- 32-address deep                                      ; 0                      ;       ;
;                                                             ;                        ;       ;
;                                                             ;                        ;       ;
; Dedicated logic registers                                   ; 24,900                 ;       ;
;     -- By type:                                             ;                        ;       ;
;         -- LAB logic registers:                             ;                        ;       ;
;             -- Primary logic registers                      ; 21,813 / 444,800       ; 5 %   ;
;             -- Secondary logic registers                    ; 1,361 / 444,800        ; < 1 % ;
;         -- Hyper-Registers:                                 ; 1,726                  ;       ;
;                                                             ;                        ;       ;
; Register control circuitry for power estimation             ; 20,935                 ;       ;
;                                                             ;                        ;       ;
; ALMs adjustment for power estimation                        ; 3,081                  ;       ;
;                                                             ;                        ;       ;
; I/O pins                                                    ; 63 / 624               ; 10 %  ;
;     -- Clock pins                                           ; 1 / 44                 ; 2 %   ;
;     -- Dedicated input pins                                 ; 0 / 54                 ; 0 %   ;
;                                                             ;                        ;       ;
; M20K blocks                                                 ; 129 / 1,611            ; 8 %   ;
; Total MLAB memory bits                                      ; 197,064                ;       ;
; Total block memory bits                                     ; 1,663,332 / 32,993,280 ; 5 %   ;
; Total block memory implementation bits                      ; 2,641,920 / 32,993,280 ; 8 %   ;
;                                                             ;                        ;       ;
; DSP Blocks Needed [=A+B+C-D]                                ; 7 / 846                ; < 1 % ;
;     [A] Total Fixed Point DSP Blocks                        ; 13                     ;       ;
;     [B] Total Floating Point DSP Blocks                     ; 0                      ;       ;
;     [C] Total DSP_PRIME Blocks                              ; 0                      ;       ;
;     [D] Estimate of DSP Blocks recoverable by dense merging ; 6                      ;       ;
;                                                             ;                        ;       ;
; IOPLLs                                                      ; 1 / 15                 ; 7 %   ;
; Global signals                                              ; 2                      ;       ;
; LVDS_RX blocks                                              ; 0 / 192                ; 0 %   ;
; HMC blocks                                                  ; 1 / 8                  ; 13 %  ;
; Maximum fan-out                                             ; 34130                  ;       ;
; Highest non-global fan-out                                  ; 1259                   ;       ;
; Total fan-out                                               ; 298517                 ;       ;
; Average fan-out                                             ; 4.98                   ;       ;
+-------------------------------------------------------------+------------------------+-------+

So, ALM usage reduced by 53%. But there is still something wrong. I will take a look.

Dolu1990 commented 3 months ago

Updated https://github.com/enjoy-digital/litex/pull/2011

NaxRiscv lost some weight :


+----------------------------------------------------------------------------------------------+
; Fitter Resource Usage Summary                                                                ;
+-------------------------------------------------------------+------------------------+-------+
; Resource                                                    ; Usage                  ; %     ;
+-------------------------------------------------------------+------------------------+-------+
; Usage Report Generated after: Place                         ;                        ;       ;
;                                                             ;                        ;       ;
; Logic utilization (ALMs needed / total ALMs on device)      ; 25,814 / 222,400       ; 12 %  ;
; ALMs needed [=A-B+C]                                        ; 25,814                 ;       ;
;     [A] ALMs used in final placement [=a+b+c+d]             ; 29,880 / 222,400       ; 13 %  ;
;         [a] ALMs used for LUT logic and register circuitry  ; 6,330                  ;       ;
;         [b] ALMs used for LUT logic                         ; 13,393                 ;       ;
;         [c] ALMs used for register circuitry                ; 4,187                  ;       ;
;         [d] ALMs used for memory (up to half of total ALMs) ; 5,970                  ;       ;
;     [B] Estimate of ALMs recoverable by dense packing       ; 4,222 / 222,400        ; 2 %   ;
;     [C] Estimate of ALMs unavailable [=a+b+c+d]             ; 156 / 222,400          ; < 1 % ;
;         [a] Due to location constrained logic               ; 0                      ;       ;
;         [b] Due to LAB-wide signal conflicts                ; 26                     ;       ;
;         [c] Due to LAB input limits                         ; 130                    ;       ;
;         [d] Due to virtual I/Os                             ; 0                      ;       ;
;                                                             ;                        ;       ;
; Difficulty packing design                                   ; Low                    ;       ;
;                                                             ;                        ;       ;
; Total LABs:  partially or completely used                   ; 3,520 / 22,240         ; 16 %  ;
;     -- Logic LABs                                           ; 2,923                  ;       ;
;     -- Memory LABs (up to half of total LABs)               ; 597                    ;       ;
;                                                             ;                        ;       ;
; Combinational ALUT usage for logic                          ; 31,218                 ;       ;
;     -- 8 input functions                                    ; 1,333                  ;       ;
;     -- 7 input functions                                    ; 131                    ;       ;
;     -- 6 input functions                                    ; 2,401                  ;       ;
;     -- 5 input functions                                    ; 7,167                  ;       ;
;     -- 4 input functions                                    ; 7,191                  ;       ;
;     -- <=3 input functions                                  ; 12,995                 ;       ;
; Combinational ALUT usage for route-throughs                 ; 2,520                  ;       ;
; Memory ALUT usage                                           ; 6,208                  ;       ;
;     -- 64-address deep                                      ; 0                      ;       ;
;     -- 32-address deep                                      ; 0                      ;       ;
;                                                             ;                        ;       ;
;                                                             ;                        ;       ;
; Dedicated logic registers                                   ; 24,104                 ;       ;
;     -- By type:                                             ;                        ;       ;
;         -- LAB logic registers:                             ;                        ;       ;
;             -- Primary logic registers                      ; 21,034 / 444,800       ; 5 %   ;
;             -- Secondary logic registers                    ; 1,354 / 444,800        ; < 1 % ;
;         -- Hyper-Registers:                                 ; 1,716                  ;       ;
;                                                             ;                        ;       ;
; Register control circuitry for power estimation             ; 11,595                 ;       ;
;                                                             ;                        ;       ;
; ALMs adjustment for power estimation                        ; 2,797                  ;       ;
;                                                             ;                        ;       ;
; I/O pins                                                    ; 63 / 624               ; 10 %  ;
;     -- Clock pins                                           ; 1 / 44                 ; 2 %   ;
;     -- Dedicated input pins                                 ; 0 / 54                 ; 0 %   ;
;                                                             ;                        ;       ;
; M20K blocks                                                 ; 129 / 1,611            ; 8 %   ;
; Total MLAB memory bits                                      ; 182,092                ;       ;
; Total block memory bits                                     ; 1,663,336 / 32,993,280 ; 5 %   ;
; Total block memory implementation bits                      ; 2,641,920 / 32,993,280 ; 8 %   ;
;                                                             ;                        ;       ;
; DSP Blocks Needed [=A+B+C-D]                                ; 7 / 846                ; < 1 % ;
;     [A] Total Fixed Point DSP Blocks                        ; 13                     ;       ;
;     [B] Total Floating Point DSP Blocks                     ; 0                      ;       ;
;     [C] Total DSP_PRIME Blocks                              ; 0                      ;       ;
;     [D] Estimate of DSP Blocks recoverable by dense merging ; 6                      ;       ;
;                                                             ;                        ;       ;
; IOPLLs                                                      ; 1 / 15                 ; 7 %   ;
; Global signals                                              ; 2                      ;       ;
; LVDS_RX blocks                                              ; 0 / 192                ; 0 %   ;
; HMC blocks                                                  ; 1 / 8                  ; 13 %  ;
; Maximum fan-out                                             ; 32869                  ;       ;
; Highest non-global fan-out                                  ; 1259                   ;       ;
; Total fan-out                                               ; 296698                 ;       ;
; Average fan-out                                             ; 4.97                   ;       ;
+-------------------------------------------------------------+------------------------+-------+
gsteiert commented 2 months ago

Found a few issues :

  • Memories with asyncronous read aren't being infered as mlab (ramstyle = "MLAB, no_rw_check" fixes it)
  • By default Quartus seems to assume that read during write on MLAB will not happen (to improve timing) at the risk of creating metastable design !? Which seems crazy to me, if i understand well. There is an option to enable "MLAB Add Timing Constraints For Mixed-Port Feed-Through Mode Setting Don't Care" to avoid this (advanced fitter settings)
  • Memories with more than 2 asyncronous read ports aren't reconized by quartus (Fixed by automatic blackboxification in SpinalHDL + Ram_1w_1ra_Generic.v)

This reduce the gap quite a lot. So, still, with a debian capable dual core, with very very relaxed timings and retiming disable, i still get 4k more register used (16% more) than on Artix 7. Which is weird as register usage is kinda a clean metric to compare things I'm working on that.

@Dolu1990 This is great insight. Can you summarize what needed to change, or point to the changes you made in the repo for this? I would like to share this with our tools team.

Dolu1990 commented 2 months ago

@gsteiert Related to the lack of MLAB inferation I had to :

That was mostly it. Another "big" change i did is to rework how the naxriscv register renaming design : https://github.com/SpinalHDL/NaxRiscv/commit/ba63ee6dfb063e0e6b1c8da51071a85dea9f934b

The "MLAB Add Timing Constraints For Mixed-Port Feed-Through Mode Setting Don't Care" isn't fixed as far as i know, that would be a thing to handle in the quartus project itself.

Dolu1990 commented 2 months ago

Note that the SpinalHDL blackboxification of memories will decompose the 1 write + 3 async read into 3 * blackbox(1 write + 1 async read)

Dolu1990 commented 2 months ago

More generaly, when i have this verilog :

    always @ (posedge clk) begin
        if(wr_en) begin
           ram_block[wr_addr] <= wr_data;
        end
    end

    assign rd_data = ram_block[rd_addr];

I do assume that the synthesis tools would infer things into lut based RAM and that :

I did had some issue with xilinx tool in the past, where they were infering it as a block ram (by merging the register which drive rd_addr) and violating the rule :

I have to say, it is nice that NaxRiscv worked on Altera FPGA without having to debug things ^.^

Dolu1990 commented 2 months ago

I may have missed the reasons why without the ( ramstyle = "MLAB ) tag, quartus refused to infer things as mlab. Fondamentaly, that was the main thing which was making the area crazy high.