Open trabucayre opened 3 months ago
./intel_agilex5e_065b_premium_devkit.py --cpu-type=naxriscv --with-fpu --with-rvc --build
real 40m10.748s
user 113m37.792s
sys 1m12.900s
Resource | Usage | % |
---|---|---|
Usage Report Generated after: Place | ||
Logic utilization (ALMs needed / total ALMs on device) | 65,574 / 222,400 | 29 % |
ALMs needed [=A-B+C] | 65,574 | |
[A] ALMs used in final placement [=a+b+c+d] | 80,052 / 222,400 | 36 % |
[a] ALMs used for LUT logic and register circuitry | 7,572 | |
[b] ALMs used for LUT logic | 49,323 | |
[c] ALMs used for register circuitry | 23,027 | |
[d] ALMs used for memory (up to half of total ALMs) | 130 | |
[B] Estimate of ALMs recoverable by dense packing | 14,731 / 222,400 | 7 % |
[C] Estimate of ALMs unavailable [=a+b+c+d] | 253 / 222,400 | < 1 % |
[a] Due to location constrained logic | 0 | |
[b] Due to LAB-wide signal conflicts | 0 | |
[c] Due to LAB input limits | 253 | |
[d] Due to virtual I/Os | 0 | |
Difficulty packing design | Low | |
Total LABs: partially or completely used | 13,842 / 22,240 | 62 % |
-- Logic LABs | 13,829 | |
-- Memory LABs (up to half of total LABs) | 13 | |
Combinational ALUT usage for logic | 70,349 | |
-- 8 input functions | 13,013 | |
-- 7 input functions | 175 | |
-- 6 input functions | 18,159 | |
-- 5 input functions | 18,685 | |
-- 4 input functions | 9,008 | |
-- <=3 input functions | 11,309 | |
Combinational ALUT usage for route-throughs | 7,884 | |
Memory ALUT usage | 73 | |
-- 64-address deep | 0 | |
-- 32-address deep | 0 | |
Dedicated logic registers | 64,853 | |
-- By type: | ||
-- LAB logic registers: | ||
-- Primary logic registers | 61,197 / 444,800 | 14 % |
-- Secondary logic registers | 2,441 / 444,800 | < 1 % |
-- Hyper-Registers: | 1,215 | |
Register control circuitry for power estimation | 260 | |
ALMs adjustment for power estimation | 16,160 | |
I/O pins | 5 / 624 | < 1 % |
-- Clock pins | 0 / 44 | 0 % |
-- Dedicated input pins | 3 / 54 | 6 % |
M20K blocks | 405 / 1,611 | 25 % |
Total MLAB memory bits | 2,160 | |
Total block memory bits | 1,718,140 / 32,993,280 | 5 % |
Total block memory implementation bits | 8,294,400 / 32,993,280 | 25 % |
DSP Blocks Needed [=A+B+C-D] | 7 / 846 | < 1 % |
[A] Total Fixed Point DSP Blocks | 13 | |
[B] Total Floating Point DSP Blocks | 0 | |
[C] Total DSP_PRIME Blocks | 0 | |
[D] Estimate of DSP Blocks recoverable by dense merging | 6 | |
IOPLLs | 0 / 15 | 0 % |
Global signals | 1 | |
LVDS_RX blocks | 0 / 192 | 0 % |
HMC blocks | 0 / 8 | 0 % |
Maximum fan-out | 68871 | |
Highest non-global fan-out | 2174 | |
Total fan-out | 622733 | |
Average fan-out | 4.48 |
time ./intel_agilex5e_065b_premium_devkit.py --cpu-type=naxriscv --xlen=64 --with-fpu --with-rvc --build
real 55m39.381s
user 157m54.672s
sys 2m9.537s
Resource | Usage | % |
---|---|---|
Usage Report Generated after: Place | ||
Logic utilization (ALMs needed / total ALMs on device) | 86,509 / 222,400 | 39 % |
ALMs needed [=A-B+C] | 86,509 | |
[A] ALMs used in final placement [=a+b+c+d] | 104,414 / 222,400 | 47 % |
[a] ALMs used for LUT logic and register circuitry | 8,717 | |
[b] ALMs used for LUT logic | 67,225 | |
[c] ALMs used for register circuitry | 28,342 | |
[d] ALMs used for memory (up to half of total ALMs) | 130 | |
[B] Estimate of ALMs recoverable by dense packing | 18,239 / 222,400 | 8 % |
[C] Estimate of ALMs unavailable [=a+b+c+d] | 334 / 222,400 | < 1 % |
[a] Due to location constrained logic | 0 | |
[b] Due to LAB-wide signal conflicts | 0 | |
[c] Due to LAB input limits | 334 | |
[d] Due to virtual I/Os | 0 | |
Difficulty packing design | Low | |
Total LABs: partially or completely used | 18,538 / 22,240 | 83 % |
-- Logic LABs | 18,525 | |
-- Memory LABs (up to half of total LABs) | 13 | |
Combinational ALUT usage for logic | 91,513 | |
-- 8 input functions | 18,566 | |
-- 7 input functions | 246 | |
-- 6 input functions | 25,000 | |
-- 5 input functions | 24,771 | |
-- 4 input functions | 9,873 | |
-- <=3 input functions | 13,057 | |
Combinational ALUT usage for route-throughs | 9,270 | |
Memory ALUT usage | 73 | |
-- 64-address deep | 0 | |
-- 32-address deep | 0 | |
Dedicated logic registers | 77,750 | |
-- By type: | ||
-- LAB logic registers: | ||
-- Primary logic registers | 74,117 / 444,800 | 17 % |
-- Secondary logic registers | 2,365 / 444,800 | < 1 % |
-- Hyper-Registers: | 1,268 | |
Register control circuitry for power estimation | 260 | |
ALMs adjustment for power estimation | 20,844 | |
I/O pins | 5 / 624 | < 1 % |
-- Clock pins | 0 / 44 | 0 % |
-- Dedicated input pins | 3 / 54 | 6 % |
M20K blocks | 411 / 1,611 | 26 % |
Total MLAB memory bits | 2,160 | |
Total block memory bits | 1,728,444 / 32,993,280 | 5 % |
Total block memory implementation bits | 8,417,280 / 32,993,280 | 26 % |
DSP Blocks Needed [=A+B+C-D] | 13 / 846 | 2 % |
[A] Total Fixed Point DSP Blocks | 25 | |
[B] Total Floating Point DSP Blocks | 0 | |
[C] Total DSP_PRIME Blocks | 0 | |
[D] Estimate of DSP Blocks recoverable by dense merging | 12 | |
IOPLLs | 0 / 15 | 0 % |
Global signals | 1 | |
LVDS_RX blocks | 0 / 192 | 0 % |
HMC blocks | 0 / 8 | 0 % |
Maximum fan-out | 81892 | |
Highest non-global fan-out | 3078 | |
Total fan-out | 787253 | |
Average fan-out | 4.54 |
./sqrl_acorn.py --integrated-main-ram-size=0x100 --cpu-type=naxriscv --with-fpu --with-rvc --build
1. Slice Logic
--------------
+----------------------------+-------+-------+------------+-----------+-------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+----------------------------+-------+-------+------------+-----------+-------+
| Slice LUTs | 23597 | 0 | 800 | 133800 | 17.64 |
| LUT as Logic | 18677 | 0 | 800 | 133800 | 13.96 |
| LUT as Memory | 4920 | 0 | 0 | 46200 | 10.65 |
| LUT as Distributed RAM | 4874 | 0 | | | |
| LUT as Shift Register | 46 | 0 | | | |
| Slice Registers | 17373 | 0 | 0 | 269200 | 6.45 |
| Register as Flip Flop | 17373 | 0 | 0 | 269200 | 6.45 |
| Register as Latch | 0 | 0 | 0 | 269200 | 0.00 |
| F7 Muxes | 278 | 0 | 400 | 66900 | 0.42 |
| F8 Muxes | 2 | 0 | 200 | 33450 | <0.01 |
+----------------------------+-------+-------+------------+-----------+-------+
1.1 Summary of Registers by Type
--------------------------------
+-------+--------------+-------------+--------------+
| Total | Clock Enable | Synchronous | Asynchronous |
+-------+--------------+-------------+--------------+
| 0 | _ | - | - |
| 0 | _ | - | Set |
| 0 | _ | - | Reset |
| 0 | _ | Set | - |
| 0 | _ | Reset | - |
| 0 | Yes | - | - |
| 87 | Yes | - | Set |
| 1198 | Yes | - | Reset |
| 90 | Yes | Set | - |
| 15998 | Yes | Reset | - |
+-------+--------------+-------------+--------------+
2. Slice Logic Distribution
---------------------------
+--------------------------------------------+-------+-------+------------+-----------+-------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+--------------------------------------------+-------+-------+------------+-----------+-------+
| Slice | 8074 | 0 | 200 | 33450 | 24.14 |
| SLICEL | 4886 | 0 | | | |
| SLICEM | 3188 | 0 | | | |
| LUT as Logic | 18677 | 0 | 800 | 133800 | 13.96 |
| using O5 output only | 3 | | | | |
| using O6 output only | 15109 | | | | |
| using O5 and O6 | 3565 | | | | |
| LUT as Memory | 4920 | 0 | 0 | 46200 | 10.65 |
| LUT as Distributed RAM | 4874 | 0 | | | |
| using O5 output only | 8 | | | | |
| using O6 output only | 3206 | | | | |
| using O5 and O6 | 1660 | | | | |
| LUT as Shift Register | 46 | 0 | | | |
| using O5 output only | 9 | | | | |
| using O6 output only | 12 | | | | |
| using O5 and O6 | 25 | | | | |
| Slice Registers | 17373 | 0 | 0 | 269200 | 6.45 |
| Register driven from within the Slice | 7661 | | | | |
| Register driven from outside the Slice | 9712 | | | | |
| LUT in front of the register is unused | 5072 | | | | |
| LUT in front of the register is used | 4640 | | | | |
| Unique Control Sets | 466 | | 200 | 33450 | 1.39 |
+--------------------------------------------+-------+-------+------------+-----------+-------+
* * Note: Available Control Sets calculated as Slice * 1, Review the Control Sets Report for more information regarding control sets.
3. Memory
---------
+-------------------+------+-------+------------+-----------+-------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+-------------------+------+-------+------------+-----------+-------+
| Block RAM Tile | 63.5 | 0 | 0 | 365 | 17.40 |
| RAMB36/FIFO* | 57 | 0 | 0 | 365 | 15.62 |
| RAMB36E1 only | 57 | | | | |
| RAMB18 | 13 | 0 | 0 | 730 | 1.78 |
| RAMB18E1 only | 13 | | | | |
+-------------------+------+-------+------------+-----------+-------+
* Note: Each Block RAM Tile only has one FIFO logic available and therefore can accommodate only one FIFO36E1 or one FIFO18E1. However, if a FIFO18E1 occupies a Block RAM Tile, that tile can still accommodate a RAMB18E1
4. DSP
------
+----------------+------+-------+------------+-----------+-------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+----------------+------+-------+------------+-----------+-------+
| DSPs | 13 | 0 | 0 | 740 | 1.76 |
| DSP48E1 only | 13 | | | | |
+----------------+------+-------+------------+-----------+-------+
./sqrl_acorn.py --integrated-main-ram-size=0x100 --cpu-type=vexriscv_smp --build --with-rvc --with-fpu --with-wishbone-memory
1. Slice Logic
--------------
+----------------------------+------+-------+------------+-----------+-------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+----------------------------+------+-------+------------+-----------+-------+
| Slice LUTs | 7755 | 0 | 800 | 133800 | 5.80 |
| LUT as Logic | 7644 | 0 | 800 | 133800 | 5.71 |
| LUT as Memory | 111 | 0 | 0 | 46200 | 0.24 |
| LUT as Distributed RAM | 76 | 0 | | | |
| LUT as Shift Register | 35 | 0 | | | |
| Slice Registers | 7840 | 0 | 0 | 269200 | 2.91 |
| Register as Flip Flop | 7840 | 0 | 0 | 269200 | 2.91 |
| Register as Latch | 0 | 0 | 0 | 269200 | 0.00 |
| F7 Muxes | 48 | 0 | 400 | 66900 | 0.07 |
| F8 Muxes | 5 | 0 | 200 | 33450 | 0.01 |
+----------------------------+------+-------+------------+-----------+-------+
1.1 Summary of Registers by Type
--------------------------------
+-------+--------------+-------------+--------------+
| Total | Clock Enable | Synchronous | Asynchronous |
+-------+--------------+-------------+--------------+
| 0 | _ | - | - |
| 0 | _ | - | Set |
| 0 | _ | - | Reset |
| 0 | _ | Set | - |
| 0 | _ | Reset | - |
| 0 | Yes | - | - |
| 36 | Yes | - | Set |
| 511 | Yes | - | Reset |
| 120 | Yes | Set | - |
| 7173 | Yes | Reset | - |
+-------+--------------+-------------+--------------+
2. Slice Logic Distribution
---------------------------
+--------------------------------------------+------+-------+------------+-----------+-------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+--------------------------------------------+------+-------+------------+-----------+-------+
| Slice | 3023 | 0 | 200 | 33450 | 9.04 |
| SLICEL | 1986 | 0 | | | |
| SLICEM | 1037 | 0 | | | |
| LUT as Logic | 7644 | 0 | 800 | 133800 | 5.71 |
| using O5 output only | 8 | | | | |
| using O6 output only | 5779 | | | | |
| using O5 and O6 | 1857 | | | | |
| LUT as Memory | 111 | 0 | 0 | 46200 | 0.24 |
| LUT as Distributed RAM | 76 | 0 | | | |
| using O5 output only | 0 | | | | |
| using O6 output only | 20 | | | | |
| using O5 and O6 | 56 | | | | |
| LUT as Shift Register | 35 | 0 | | | |
| using O5 output only | 3 | | | | |
| using O6 output only | 0 | | | | |
| using O5 and O6 | 32 | | | | |
| Slice Registers | 7840 | 0 | 0 | 269200 | 2.91 |
| Register driven from within the Slice | 3411 | | | | |
| Register driven from outside the Slice | 4429 | | | | |
| LUT in front of the register is unused | 2520 | | | | |
| LUT in front of the register is used | 1909 | | | | |
| Unique Control Sets | 172 | | 200 | 33450 | 0.51 |
+--------------------------------------------+------+-------+------------+-----------+-------+
* * Note: Available Control Sets calculated as Slice * 1, Review the Control Sets Report for more information regarding control sets.
3. Memory
---------
+-------------------+------+-------+------------+-----------+-------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+-------------------+------+-------+------------+-----------+-------+
| Block RAM Tile | 16.5 | 0 | 0 | 365 | 4.52 |
| RAMB36/FIFO* | 15 | 0 | 0 | 365 | 4.11 |
| RAMB36E1 only | 15 | | | | |
| RAMB18 | 3 | 0 | 0 | 730 | 0.41 |
| RAMB18E1 only | 3 | | | | |
+-------------------+------+-------+------------+-----------+-------+
* Note: Each Block RAM Tile only has one FIFO logic available and therefore can accommodate only one FIFO36E1 or one FIFO18E1. However, if a FIFO18E1 occupies a Block RAM Tile, that tile can still accommodate a RAMB18E1
4. DSP
------
+----------------+------+-------+------------+-----------+-------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+----------------+------+-------+------------+-----------+-------+
| DSPs | 13 | 0 | 0 | 740 | 1.76 |
| DSP48E1 only | 13 | | | | |
+----------------+------+-------+------------+-----------+-------+
time ./intel_agilex5e_065b_premium_devkit.py --bus-standard=axi-lite \
--cpu-type=vexiiriscv --cpu-variant=linux --vexii-args="--with-rvc --with-rvf" \
--with-coherent-dma --build
Resource | Usage | % |
---|---|---|
Usage Report Generated after: Place | ||
Logic utilization (ALMs needed / total ALMs on device) | 9,609 / 222,400 | 4 % |
ALMs needed [=A-B+C] | 9,609 | |
[A] ALMs used in final placement [=a+b+c+d] | 12,073 / 222,400 | 5 % |
[a] ALMs used for LUT logic and register circuitry | 2,494 | |
[b] ALMs used for LUT logic | 6,137 | |
[c] ALMs used for register circuitry | 3,442 | |
[d] ALMs used for memory (up to half of total ALMs) | 0 | |
[B] Estimate of ALMs recoverable by dense packing | 2,501 / 222,400 | 1 % |
[C] Estimate of ALMs unavailable [=a+b+c+d] | 37 / 222,400 | < 1 % |
[a] Due to location constrained logic | 0 | |
[b] Due to LAB-wide signal conflicts | 0 | |
[c] Due to LAB input limits | 37 | |
[d] Due to virtual I/Os | 0 | |
Difficulty packing design | Low | |
Total LABs: partially or completely used | 1,553 / 22,240 | 7 % |
-- Logic LABs | 1,553 | |
-- Memory LABs (up to half of total LABs) | 0 | |
Combinational ALUT usage for logic | 13,086 | |
-- 8 input functions | 607 | |
-- 7 input functions | 107 | |
-- 6 input functions | 1,710 | |
-- 5 input functions | 3,141 | |
-- 4 input functions | 2,550 | |
-- <=3 input functions | 4,971 | |
Combinational ALUT usage for route-throughs | 1,488 | |
Dedicated logic registers | 12,671 | |
-- By type: | ||
-- LAB logic registers: | ||
-- Primary logic registers | 11,872 / 444,800 | 3 % |
-- Secondary logic registers | 740 / 444,800 | < 1 % |
-- Hyper-Registers: | 59 | |
Register control circuitry for power estimation | 0 | |
ALMs adjustment for power estimation | 1,947 | |
I/O pins | 5 / 624 | < 1 % |
-- Clock pins | 0 / 44 | 0 % |
-- Dedicated input pins | 3 / 54 | 6 % |
M20K blocks | 67 / 1,611 | 4 % |
Total MLAB memory bits | 0 | |
Total block memory bits | 567,952 / 32,993,280 | 2 % |
Total block memory implementation bits | 1,372,160 / 32,993,280 | 4 % |
DSP Blocks Needed [=A+B+C-D] | 2 / 846 | < 1 % |
[A] Total Fixed Point DSP Blocks | 4 | |
[B] Total Floating Point DSP Blocks | 0 | |
[C] Total DSP_PRIME Blocks | 0 | |
[D] Estimate of DSP Blocks recoverable by dense merging | 2 | |
IOPLLs | 0 / 15 | 0 % |
Global signals | 1 | |
LVDS_RX blocks | 0 / 192 | 0 % |
HMC blocks | 0 / 8 | 0 % |
Maximum fan-out | 14107 | |
Highest non-global fan-out | 3342 | |
Total fan-out | 116227 | |
Average fan-out | 4.27 |
Hi ^^
Logic utilization (ALMs needed / total ALMs on device) 65,574 / 222,400
So yes, now i'm sure there is some memory not being properly infered into mlab / blockram.
Especialy when comparing the number of registers : Dedicated logic registers 64,853 vs Slice Registers | 17373
I will take a look at this.
./litex_acorn_baseboard_mini.py --sys-clk-freq 100e6 --bus-standard=axi-lite --cpu-type=vexiiriscv --cpu-variant=linux --vexii-args="--with-rvc --with-rvf" --with-coherent-dma --build --integrated-main-ram-size=0x100
1. Slice Logic
--------------
+----------------------------+------+-------+------------+-----------+-------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+----------------------------+------+-------+------------+-----------+-------+
| Slice LUTs | 8844 | 0 | 800 | 133800 | 6.61 |
| LUT as Logic | 8546 | 0 | 800 | 133800 | 6.39 |
| LUT as Memory | 298 | 0 | 0 | 46200 | 0.65 |
| LUT as Distributed RAM | 226 | 0 | | | |
| LUT as Shift Register | 72 | 0 | | | |
| Slice Registers | 7705 | 0 | 0 | 269200 | 2.86 |
| Register as Flip Flop | 7705 | 0 | 0 | 269200 | 2.86 |
| Register as Latch | 0 | 0 | 0 | 269200 | 0.00 |
| F7 Muxes | 0 | 0 | 400 | 66900 | 0.00 |
| F8 Muxes | 0 | 0 | 200 | 33450 | 0.00 |
+----------------------------+------+-------+------------+-----------+-------+
1.1 Summary of Registers by Type
--------------------------------
+-------+--------------+-------------+--------------+
| Total | Clock Enable | Synchronous | Asynchronous |
+-------+--------------+-------------+--------------+
| 0 | _ | - | - |
| 0 | _ | - | Set |
| 0 | _ | - | Reset |
| 0 | _ | Set | - |
| 0 | _ | Reset | - |
| 0 | Yes | - | - |
| 47 | Yes | - | Set |
| 626 | Yes | - | Reset |
| 81 | Yes | Set | - |
| 6951 | Yes | Reset | - |
+-------+--------------+-------------+--------------+
2. Slice Logic Distribution
---------------------------
+--------------------------------------------+------+-------+------------+-----------+-------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+--------------------------------------------+------+-------+------------+-----------+-------+
| Slice | 3001 | 0 | 200 | 33450 | 8.97 |
| SLICEL | 1980 | 0 | | | |
| SLICEM | 1021 | 0 | | | |
| LUT as Logic | 8546 | 0 | 800 | 133800 | 6.39 |
| using O5 output only | 2 | | | | |
| using O6 output only | 7105 | | | | |
| using O5 and O6 | 1439 | | | | |
| LUT as Memory | 298 | 0 | 0 | 46200 | 0.65 |
| LUT as Distributed RAM | 226 | 0 | | | |
| using O5 output only | 0 | | | | |
| using O6 output only | 2 | | | | |
| using O5 and O6 | 224 | | | | |
| LUT as Shift Register | 72 | 0 | | | |
| using O5 output only | 29 | | | | |
| using O6 output only | 19 | | | | |
| using O5 and O6 | 24 | | | | |
| Slice Registers | 7705 | 0 | 0 | 269200 | 2.86 |
| Register driven from within the Slice | 3877 | | | | |
| Register driven from outside the Slice | 3828 | | | | |
| LUT in front of the register is unused | 1921 | | | | |
| LUT in front of the register is used | 1907 | | | | |
| Unique Control Sets | 172 | | 200 | 33450 | 0.51 |
+--------------------------------------------+------+-------+------------+-----------+-------+
* * Note: Available Control Sets calculated as Slice * 1, Review the Control Sets Report for more information regarding control sets.
3. Memory
---------
+-------------------+------+-------+------------+-----------+-------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+-------------------+------+-------+------------+-----------+-------+
| Block RAM Tile | 35.5 | 0 | 0 | 365 | 9.73 |
| RAMB36/FIFO* | 23 | 0 | 0 | 365 | 6.30 |
| RAMB36E1 only | 23 | | | | |
| RAMB18 | 25 | 0 | 0 | 730 | 3.42 |
| RAMB18E1 only | 25 | | | | |
+-------------------+------+-------+------------+-----------+-------+
* Note: Each Block RAM Tile only has one FIFO logic available and therefore can accommodate only one FIFO36E1 or one FIFO18E1. However, if a FIFO18E1 occupies a Block RAM Tile, that tile can still accommodate a RAMB18E1
4. DSP
------
+----------------+------+-------+------------+-----------+-------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+----------------+------+-------+------------+-----------+-------+
| DSPs | 4 | 0 | 0 | 740 | 0.54 |
| DSP48E1 only | 4 | | | | |
+----------------+------+-------+------------+-----------+-------+
Hmm same story for vexiiRiscv, it is less showing than for naxriscv, but for sure, some memory aren't being infered as memory. I will take a look at that aswell.
Found a few issues :
This reduce the gap quite a lot. So, still, with a debian capable dual core, with very very relaxed timings and retiming disable, i still get 4k more register used (16% more) than on Artix 7. Which is weird as register usage is kinda a clean metric to compare things I'm working on that.
Thanks @Dolu1990 for the first analysis!
I pushed the current WIP (https://github.com/enjoy-digital/litex/pull/2011)
Good, thanks!
./intel_agilex5e_065b_premium_devkit.py --cpu-type=naxriscv --with-fpu --with-rvc --build
real 40m10.748s
user 113m37.792s
sys 1m12.900s
+----------------------------------------------------------------------------------------------+
; Fitter Resource Usage Summary ;
+-------------------------------------------------------------+------------------------+-------+
; Resource ; Usage ; % ;
+-------------------------------------------------------------+------------------------+-------+
; Usage Report Generated after: Place ; ; ;
; ; ; ;
; Logic utilization (ALMs needed / total ALMs on device) ; 30,153 / 222,400 ; 14 % ;
; ALMs needed [=A-B+C] ; 30,153 ; ;
; [A] ALMs used in final placement [=a+b+c+d] ; 34,334 / 222,400 ; 15 % ;
; [a] ALMs used for LUT logic and register circuitry ; 6,155 ; ;
; [b] ALMs used for LUT logic ; 12,787 ; ;
; [c] ALMs used for register circuitry ; 4,752 ; ;
; [d] ALMs used for memory (up to half of total ALMs) ; 10,640 ; ;
; [B] Estimate of ALMs recoverable by dense packing ; 4,342 / 222,400 ; 2 % ;
; [C] Estimate of ALMs unavailable [=a+b+c+d] ; 161 / 222,400 ; < 1 % ;
; [a] Due to location constrained logic ; 0 ; ;
; [b] Due to LAB-wide signal conflicts ; 30 ; ;
; [c] Due to LAB input limits ; 131 ; ;
; [d] Due to virtual I/Os ; 0 ; ;
; ; ; ;
; Difficulty packing design ; Low ; ;
; ; ; ;
; Total LABs: partially or completely used ; 4,073 / 22,240 ; 18 % ;
; -- Logic LABs ; 3,009 ; ;
; -- Memory LABs (up to half of total LABs) ; 1,064 ; ;
; ; ; ;
; Combinational ALUT usage for logic ; 30,003 ; ;
; -- 8 input functions ; 1,299 ; ;
; -- 7 input functions ; 131 ; ;
; -- 6 input functions ; 2,429 ; ;
; -- 5 input functions ; 6,582 ; ;
; -- 4 input functions ; 6,888 ; ;
; -- <=3 input functions ; 12,674 ; ;
; Combinational ALUT usage for route-throughs ; 2,582 ; ;
; Memory ALUT usage ; 6,675 ; ;
; -- 64-address deep ; 0 ; ;
; -- 32-address deep ; 0 ; ;
; ; ; ;
; ; ; ;
; Dedicated logic registers ; 24,900 ; ;
; -- By type: ; ; ;
; -- LAB logic registers: ; ; ;
; -- Primary logic registers ; 21,813 / 444,800 ; 5 % ;
; -- Secondary logic registers ; 1,361 / 444,800 ; < 1 % ;
; -- Hyper-Registers: ; 1,726 ; ;
; ; ; ;
; Register control circuitry for power estimation ; 20,935 ; ;
; ; ; ;
; ALMs adjustment for power estimation ; 3,081 ; ;
; ; ; ;
; I/O pins ; 63 / 624 ; 10 % ;
; -- Clock pins ; 1 / 44 ; 2 % ;
; -- Dedicated input pins ; 0 / 54 ; 0 % ;
; ; ; ;
; M20K blocks ; 129 / 1,611 ; 8 % ;
; Total MLAB memory bits ; 197,064 ; ;
; Total block memory bits ; 1,663,332 / 32,993,280 ; 5 % ;
; Total block memory implementation bits ; 2,641,920 / 32,993,280 ; 8 % ;
; ; ; ;
; DSP Blocks Needed [=A+B+C-D] ; 7 / 846 ; < 1 % ;
; [A] Total Fixed Point DSP Blocks ; 13 ; ;
; [B] Total Floating Point DSP Blocks ; 0 ; ;
; [C] Total DSP_PRIME Blocks ; 0 ; ;
; [D] Estimate of DSP Blocks recoverable by dense merging ; 6 ; ;
; ; ; ;
; IOPLLs ; 1 / 15 ; 7 % ;
; Global signals ; 2 ; ;
; LVDS_RX blocks ; 0 / 192 ; 0 % ;
; HMC blocks ; 1 / 8 ; 13 % ;
; Maximum fan-out ; 34130 ; ;
; Highest non-global fan-out ; 1259 ; ;
; Total fan-out ; 298517 ; ;
; Average fan-out ; 4.98 ; ;
+-------------------------------------------------------------+------------------------+-------+
So, ALM usage reduced by 53%. But there is still something wrong. I will take a look.
Updated https://github.com/enjoy-digital/litex/pull/2011
NaxRiscv lost some weight :
+----------------------------------------------------------------------------------------------+
; Fitter Resource Usage Summary ;
+-------------------------------------------------------------+------------------------+-------+
; Resource ; Usage ; % ;
+-------------------------------------------------------------+------------------------+-------+
; Usage Report Generated after: Place ; ; ;
; ; ; ;
; Logic utilization (ALMs needed / total ALMs on device) ; 25,814 / 222,400 ; 12 % ;
; ALMs needed [=A-B+C] ; 25,814 ; ;
; [A] ALMs used in final placement [=a+b+c+d] ; 29,880 / 222,400 ; 13 % ;
; [a] ALMs used for LUT logic and register circuitry ; 6,330 ; ;
; [b] ALMs used for LUT logic ; 13,393 ; ;
; [c] ALMs used for register circuitry ; 4,187 ; ;
; [d] ALMs used for memory (up to half of total ALMs) ; 5,970 ; ;
; [B] Estimate of ALMs recoverable by dense packing ; 4,222 / 222,400 ; 2 % ;
; [C] Estimate of ALMs unavailable [=a+b+c+d] ; 156 / 222,400 ; < 1 % ;
; [a] Due to location constrained logic ; 0 ; ;
; [b] Due to LAB-wide signal conflicts ; 26 ; ;
; [c] Due to LAB input limits ; 130 ; ;
; [d] Due to virtual I/Os ; 0 ; ;
; ; ; ;
; Difficulty packing design ; Low ; ;
; ; ; ;
; Total LABs: partially or completely used ; 3,520 / 22,240 ; 16 % ;
; -- Logic LABs ; 2,923 ; ;
; -- Memory LABs (up to half of total LABs) ; 597 ; ;
; ; ; ;
; Combinational ALUT usage for logic ; 31,218 ; ;
; -- 8 input functions ; 1,333 ; ;
; -- 7 input functions ; 131 ; ;
; -- 6 input functions ; 2,401 ; ;
; -- 5 input functions ; 7,167 ; ;
; -- 4 input functions ; 7,191 ; ;
; -- <=3 input functions ; 12,995 ; ;
; Combinational ALUT usage for route-throughs ; 2,520 ; ;
; Memory ALUT usage ; 6,208 ; ;
; -- 64-address deep ; 0 ; ;
; -- 32-address deep ; 0 ; ;
; ; ; ;
; ; ; ;
; Dedicated logic registers ; 24,104 ; ;
; -- By type: ; ; ;
; -- LAB logic registers: ; ; ;
; -- Primary logic registers ; 21,034 / 444,800 ; 5 % ;
; -- Secondary logic registers ; 1,354 / 444,800 ; < 1 % ;
; -- Hyper-Registers: ; 1,716 ; ;
; ; ; ;
; Register control circuitry for power estimation ; 11,595 ; ;
; ; ; ;
; ALMs adjustment for power estimation ; 2,797 ; ;
; ; ; ;
; I/O pins ; 63 / 624 ; 10 % ;
; -- Clock pins ; 1 / 44 ; 2 % ;
; -- Dedicated input pins ; 0 / 54 ; 0 % ;
; ; ; ;
; M20K blocks ; 129 / 1,611 ; 8 % ;
; Total MLAB memory bits ; 182,092 ; ;
; Total block memory bits ; 1,663,336 / 32,993,280 ; 5 % ;
; Total block memory implementation bits ; 2,641,920 / 32,993,280 ; 8 % ;
; ; ; ;
; DSP Blocks Needed [=A+B+C-D] ; 7 / 846 ; < 1 % ;
; [A] Total Fixed Point DSP Blocks ; 13 ; ;
; [B] Total Floating Point DSP Blocks ; 0 ; ;
; [C] Total DSP_PRIME Blocks ; 0 ; ;
; [D] Estimate of DSP Blocks recoverable by dense merging ; 6 ; ;
; ; ; ;
; IOPLLs ; 1 / 15 ; 7 % ;
; Global signals ; 2 ; ;
; LVDS_RX blocks ; 0 / 192 ; 0 % ;
; HMC blocks ; 1 / 8 ; 13 % ;
; Maximum fan-out ; 32869 ; ;
; Highest non-global fan-out ; 1259 ; ;
; Total fan-out ; 296698 ; ;
; Average fan-out ; 4.97 ; ;
+-------------------------------------------------------------+------------------------+-------+
Found a few issues :
- Memories with asyncronous read aren't being infered as mlab (ramstyle = "MLAB, no_rw_check" fixes it)
- By default Quartus seems to assume that read during write on MLAB will not happen (to improve timing) at the risk of creating metastable design !? Which seems crazy to me, if i understand well. There is an option to enable "MLAB Add Timing Constraints For Mixed-Port Feed-Through Mode Setting Don't Care" to avoid this (advanced fitter settings)
- Memories with more than 2 asyncronous read ports aren't reconized by quartus (Fixed by automatic blackboxification in SpinalHDL + Ram_1w_1ra_Generic.v)
This reduce the gap quite a lot. So, still, with a debian capable dual core, with very very relaxed timings and retiming disable, i still get 4k more register used (16% more) than on Artix 7. Which is weird as register usage is kinda a clean metric to compare things I'm working on that.
@Dolu1990 This is great insight. Can you summarize what needed to change, or point to the changes you made in the repo for this? I would like to share this with our tools team.
@gsteiert Related to the lack of MLAB inferation I had to :
That was mostly it. Another "big" change i did is to rework how the naxriscv register renaming design : https://github.com/SpinalHDL/NaxRiscv/commit/ba63ee6dfb063e0e6b1c8da51071a85dea9f934b
The "MLAB Add Timing Constraints For Mixed-Port Feed-Through Mode Setting Don't Care" isn't fixed as far as i know, that would be a thing to handle in the quartus project itself.
Note that the SpinalHDL blackboxification of memories will decompose the 1 write + 3 async read into 3 * blackbox(1 write + 1 async read)
More generaly, when i have this verilog :
always @ (posedge clk) begin
if(wr_en) begin
ram_block[wr_addr] <= wr_data;
end
end
assign rd_data = ram_block[rd_addr];
I do assume that the synthesis tools would infer things into lut based RAM and that :
I did had some issue with xilinx tool in the past, where they were infering it as a block ram (by merging the register which drive rd_addr) and violating the rule :
I have to say, it is nice that NaxRiscv worked on Altera FPGA without having to debug things ^.^
I may have missed the reasons why without the ( ramstyle = "MLAB ) tag, quartus refused to infer things as mlab. Fondamentaly, that was the main thing which was making the area crazy high.
VexRiscv SMP
Intel Agilex5E board