SpinalHDL / VexRiscv

A FPGA friendly 32 bit RISC-V CPU implementation
MIT License
2.51k stars 418 forks source link

Can VexRISCV based SoC be made to fit on an iCE40 1k? #28

Open mithro opened 6 years ago

mithro commented 6 years ago

So, this is a pretty big challenge I think....

The iCE40 1K has the following resources;

However, there might be hope, your stats seem to indicate the following;

VexRiscv smallest (RV32I, 0.52 DMIPS/Mhz, no datapath bypass, no interrupt) ->
  iCE40      -> 81 Mhz 1130 LC

VexRiscv smallest (RV32I, 0.52 DMIPS/Mhz, no datapath bypass) ->
  iCE40      -> 71 Mhz 1278 LC

But that is obviously only the CPU.

The SoC seems to use about double the LC;

Murax interlocked stages (0.45 DMIPS/Mhz) ->
  ICE40-HX   ->  51 Mhz 2402 LC (icestorm)

MuraxFast bypassed stages (0.65 DMIPS/Mhz) ->
  ICE40-HX   ->  50 Mhz, 2787 LC  (icestorm)

Thoughts?

Dolu1990 commented 6 years ago

Actualy, you VexRiscv smallest no interrupt could be a little smaller if you set the earlyInjection from the DBusSimplePlugin. It will write the data read access back into the pipeline into the memory stage instead of the writeBack stage. It reduce the sice to 1064 LC.

Then someother optimisation have impacte on none ICE40 FPGA, as in ice40 if you use only the LUT of an LC you lose the FF (and the reverse too)

Hope it can help ^^

Dolu1990 commented 6 years ago

Would maybe be possible to have a 4 stage RISCV (no memory stage) + moving the register file read into the execute stage to avoid having to pipeline the reg files values to the execute stage.

It would for sure benefits regular FPGA area, but for ice40 the fact that LUT can't be reused if the FF of the LE is used nullify many possible optimisation.

Dolu1990 commented 6 years ago

Doing some experiments, by removing 1 fetch stage and moving the register file read into the execute stage i got :

MicroNoCsr -> iCE40 -> 57 Mhz 999 LC <- 999 is actualy smaller than 1K XD Artix 7 -> 330 Mhz 537 LUT 389 FF Cyclone V -> 144 Mhz 366 ALMs Cyclone IV -> 120 Mhz 714 LUT 314 FF

mithro commented 6 years ago

@Dolu1990 That is pretty awesome! Just need to figure out how to shove the rest of a SoC into the 1k too :-P

Dolu1990 commented 6 years ago

I made more experiements (not tested the RTL in simulation, but the synthesis numbers look realistic)

GenSmallestNoCsr (the previous smallest configuration) -> iCE40 -> 60 Mhz 1050 LC Artix 7 -> 355 Mhz 518 LUT 567 FF Cyclone V -> 199 Mhz 331 ALMs Cyclone IV -> 183 Mhz 656 LUT 492 FF

MicroNoCsr is based on GenSmallestNoCsr MicroNoCsr removed 1 fetch stage (R1FS) -> iCE40 -> 57 Mhz 999 LC Artix 7 -> 330 Mhz 537 LUT 389 FF Cyclone V -> 144 Mhz 366 ALMs Cyclone IV -> 120 Mhz 714 LUT 314 FF

MicroNoCsr R1FS + removed memory stage (RMS) -> iCE40 -> 58 Mhz 937 LC Artix 7 -> 349 Mhz 547 LUT 319 FF Cyclone V -> 145 Mhz 369 ALMs Cyclone IV -> 114 Mhz 714 LUT 244 FF

MicroNoCsr R1FS + RMS + iBus externaly keep last responses (IEKLR) -> iCE40 -> 54 Mhz 856 LC Artix 7 -> 343 Mhz 509 LUT 288 FF Cyclone V -> 136 Mhz 343 ALMs Cyclone IV -> 118 Mhz 677 LUT 213 FF

MicroNoCsr R1FS + RMS + IEKLR + Only load word (OLW) -> iCE40 -> 59 Mhz 823 LC Artix 7 -> 353 Mhz 493 LUT 283 FF Cyclone V -> 148 Mhz 328 ALMs Cyclone IV -> 112 Mhz 648 LUT 208 FF

MicroNoCsr R1FS + RMS + IEKLR + OLW + No shift instruction (NSI) -> iCE40 -> 61 Mhz 722 LC Artix 7 -> 387 Mhz 444 LUT 275 FF Cyclone V -> 134 Mhz 253 ALMs Cyclone IV -> 116 Mhz 556 LUT 200 FF

MicroNoCsr R1FS + RMS + IEKLR + OLW + NSI + Pessimistic hazard (PH) -> iCE40 -> 59 Mhz 689 LC Artix 7 -> 418 Mhz 427 LUT 265 FF Cyclone V -> 147 Mhz 241 ALMs Cyclone IV -> 119 Mhz 528 LUT 190 FF

Dolu1990 commented 6 years ago

Only the PH optimisation reduce the IPC

regymm commented 3 years ago

I've just fitted the GenSmallestNoCsr into iCE40-HX1K with 2KB ROM, 4KB RAM, UART RX, and some memory-mapped GPIO for processing VT100 sequences -- VexRISCV is very promising!

The most difficult part for me is the block ram to iBus/dBus interface, as I didn't find any detailed documents about the buses. So it'll be nice if there're some examples for directly using block ram for instruction/data(Murax's arbitrator seems too much for the smallest design).

Dolu1990 commented 3 years ago

hmm using a true dual ported ram would produce the best results i guess.

For the ibus => https://github.com/SpinalHDL/VexRiscv#ibussimpleplugin

Stream and Flow are specified in the SpinalDoc.

For the dbus => https://github.com/SpinalHDL/VexRiscv#dbussimpleplugin

It is nearly similar, excepted for the rsp arbitration.