Congratulations and Ideas

Mecrisp commented 4 years ago

Dear Bruno,

my congratulations for squeezing a RV32I core into the Icestick !

I read your Verilog files with joy and I wish to share an idea on how to save a few more LUTs for more peripherals: Try an "one-hot" IO address decoder. You have few IO registers only, so you can reserve one address line for each of your peripheral registers and save LUTs on comparisons with the full IO address. This also allows to set multiple IO registers at once.

You can also insert a hardware random number generator by using a ring oscillator.

Maybe you wish to check out Mecrisp-Ice from mecrisp.sourceforge.net in file mecrisp-ice-1.8/hx1k/icestorm/j1a.v for my peripheral set in use on the Icestick. Mecrisp-Ice is a Forth compiler running on a stack processor, which is a descendant of Swapforth and the J1a CPU by James Bowman. I think you can borrow a few of the ideas !

If you manage to map the SPI flash into the memory bus within the available LUTs, similiar to the memory interface in Picosoc, I would be happy to officially port Mecrisp-Quintus (a RISC-V Forth which needs about 24kb flash and 4 kb RAM) to your FemtoRV32 on the Icestick.

Hats off and best wishes from Germany, Matthias

PS: Completely removing the rdRAM wire in your memory design somehow saved 20 LUTs.

BrunoLevy commented 4 years ago

Dear Matthias,

Thank you very much for your comments and ideas, I'm very glad to have some feedback.

I'll try your "one-hot" IO address : I have up to ten peripherals and 8 IO address lines only, but I can probably reorganize them, or change a bit the memory layout so that everything fits in there.
Thank you very much for the link to Mecrisp: I was aware of J1 (it was one of my starting points ! great source of inspiration, showing that CPU on IceStick is possible. I also borrowed their UART). I'm fascinated by the designs that pack so much functionality into so tiny devices.
Yes, mapping the SPI flash would be a must ! I still need to learn a lot (I was not aware that there was so much available space in it).
About rdRAM, I thought that unused signals were optimized out, good to know, thanks !

Best wishes, -- Bruno

BrunoLevy commented 4 years ago

Hi Matthias, I just switched to "one-hot" adressing mode, and yes, it saved a lot of LUTs ! Thank you very much for your comment. I'm now working on squeezing a bit the IO space by merging things (so that largest offset remains <= 1024, to be able to write to an IO using a single SW instr.).

Mecrisp commented 4 years ago

Dear Bruno,

you are welcome, I am glad these ideas were useful for you.

If you do loads and stores with an offset relative to the zero-register x0, you get quick access to two "zero pages", which split into the very low addresses (positive offset) and the very high addresses (negative offset). A nice place in the memory map for RAM and IO.

The Icestick is a rewarding target, squeezing designs into it feels like the FPGA equivalent of a sizecoding contest. You are really pushing the envelope here !

Matthias

BrunoLevy commented 4 years ago

"zero-page" sounds very 6502-ish to me :-) To me, there is still a lot of mystery about what eats-up LUTs / what saves LUTs, sometimes the behavior is very counter-intuitive, any hint / general rules for that ?

Mecrisp commented 4 years ago

Same observation here, it's difficult to predict, and details sometimes change when updating to a newer Yosys release.

Try to imagine how you would implement something in TTL logic gates with an soldering iron. The "one hot" address decoder is just "io-write and address_line[x]", one gate, one LUT. HX1K has LUTs with 4 inputs, which is a good fit. If you have comparisons with multiple bits, you likely will need more LUTs. Greater/less comparisons require a carry chain, which usually requires more logic than equal/unequal.

The document "iCE 2017-08 Technology Library" (www.latticesemi.com/view_document?document_id=52206) will give you an overview of the functions directly available on the FPGA. If you can map your logic directly to these primitives, you'll get a good resource mileage.

Additionally, always specify the widths of everything. If not specified, Verilog mandates to use the most wide width possibly involved, which for example results in a logic operation carried out with full 32 bits internally, consuming LUTs, and the result is then truncated afterwards to the desired output width.

Yosys usually removes unused parts of the logic, but I assume the memory read wire somehow was pattern matched into a standard block RAM implementation and constant folding optimisation failed therefore. Completely unused blocks are optimised away; but try removing unused logic which is connected at one end to logic which is in use.

Reordering of source lines, especially in CASEZ constructs, sometimes yields mysterious results in terms of LUT usage. I think this is because the internal ordering affects further optimisation steps during synthesis. Specify "don't care" values with "?".

Matthias

PS: Yes, it's the same trick as on 6502.

BrunoLevy commented 4 years ago

@Mecrisp , do you know where I can find some documentation about the SPI flash used in the IceStick ? (I tryed interfacing a design from: https://github.com/smunaut/ice40-playground/blob/master/cores/spi_flash/rtl/spi_flash_reader.v without success, but I must admit I do not understand what I'm doing)

Thank you in advance, -- Bruno.

Mecrisp commented 4 years ago

Yes, it is a vanilla SPI flash chip, part number N25Q032A.

https://www.micron.com/-/media/client/global/Documents/Products/Data%20Sheet/NOR%20Flash/Serial%20NOR/N25Q/n25q_32mb_1_8v_65nm.pdf

If you are fluent in Forth, have a look at mecrisp-ice-1.8/hx1k/nucleus.fs for how to read a sector from this chip.

BrunoLevy commented 4 years ago

Thanks ! I have seen the Forth functions in mecrisp, but I do not understand Forth ! (but I'll try, looks like my Hewlett Packard calculator, stack based, push operands then operation), Any reference with a good introduction to Forth ? (I have a feeling that it will be easier than digging in micron's datasheet :-)

Mecrisp commented 4 years ago

A small intro to give you an idea and the classic introductory text:

https://jeelabs.org/article/1612b/ https://www.forth.com/starting-forth/

But I think it would be much easier for you to search for Arduino code to interface vanilla SPI flash memory chips, as they have a standard interface.

Mecrisp commented 4 years ago

Here is a better datasheet. You need the "read data" command 03 (and usually the "release power-down" command AB, which you can omit on Icestick).

https://www.winbond.com/resource-files/w25q128jv%20spi%20revc%2011162016.pdf

BrunoLevy commented 4 years ago

Thank you very much for all these links, it helped a lot ! Now we have mapped IO to read the SPI flash. Comming next: memory interface with address valid<->RAM ready handshaking, to be able to directly execute code from there (hope it won't eat up too many LUTs...)

Mecrisp commented 4 years ago

I found another place to save a few LUTs:

        3'b100: out = ($signed(in1) < $signed(in2));  // BLT
        3'b101: out = ($signed(in1) >= $signed(in2)); // BGE
        3'b110: out = (in1 < in2);                    // BLTU
    3'b111: out = (in1 >= in2);                   // BGEU

Every of these comparisons requires a 32/33 bit subtraction, but all conditions can be generated by using one subtraction only:

 wire [16:0] minus = {1'b1, ~st0} + st1 + 1;

  wire signedless = st0[15] ^ st1[15] ? st1[15] : minus[16];
  wire unsignedless = minus[16];
  wire zeroflag = minus[15:0] == 0;

      9'b0_011_00111: st0N = {16{zeroflag}};                        //  =
      9'b0_011_01000: st0N = {16{signedless}};                      //  <
      9'b0_011_01100: st0N = minus[15:0];                           //  -
      9'b0_011_01111: st0N = {16{unsignedless}};                    // u<

You get the idea :-)

BrunoLevy commented 4 years ago

Let's try that ! (smart and crazy at the same time, I love it !). I'm pretty sure I won't get it right the first time though... (I'm always confused when handling signed quantities

Mecrisp commented 4 years ago

Hey, thanks ! Have fun !

Mecrisp commented 4 years ago

I usually need a few tries for properly handling signed values, too. But there are maps into these mostly uncharted territories:

If you like tricks like these, I wish to recommend you this

https://graphics.stanford.edu/~seander/bithacks.html

and the book "Hacker's Delight" by Warren. It's full of small tricks which are very useful for compiler writers and processor designers.

https://en.wikipedia.org/wiki/Hacker's_Delight

BrunoLevy commented 4 years ago

Just tried your elegant trick for the branch predicates, and made it work, however it uses 37 more LUTs (???). LUT golfing is something between art and sorcery it seems ! Still looking for subtracts that I could "factor" in the design, it seems that there is a couple of them...

(BTW, thank you for the two links, excellent !!)

mithro commented 4 years ago

I also want to say congratulations on making a RISC-V that fits on the icestick! That is super impressive. If you would like to port your work to run on the Fomu (https://fomu.im) which has an iCE40UP5K, send me an email to me@mith.ro and I'll send you some!

Have you tried playing with Yosys settings -- there are a lot of options you can tell Yosys to give to ABC to change the area verse frequency trade offs.

Have you looked at serv from @olofk -- it is a bit serial based RISC-V implementation and @olofk has been slowly trimming the core down -- see http://corescore.store/

It might also be interesting to integrate your RISC-V core into the LiteX environment. It already supports quite a few number of different RISC-V (and other architectures) cores. See

BrunoLevy commented 4 years ago

@mecrisp, I am completely confused with what takes LUT and what does not: Cleaning up a bit the implementation, I wanted to use parameters and 'generate' statement instead of macros, and just adding this parameter to the ALU and without changing anything else eats up 100 LUTs !?!??

module NrvALU #( parameter [0:0] TWOSTAGE_SHIFTER = 0
) ( input clk, input [31:0] in1, input [31: ...

Mecrisp commented 4 years ago

@mithro

lot of options you can tell Yosys to give to ABC

I tried adding "-abc2 -relut" to the synth_ice40 command, according to this documentation:

http://www.clifford.at/yosys/cmd_synth_ice40.html

On my project, Mecrisp-Ice 1.8, it improved from 1273 to 1224 ICESTORM_LC for HX1K.

But this is not "lot of options". I am surely missing something. Could you please point me to a configuration for synthesis with aggressive optimisation for size ?

Mecrisp commented 4 years ago

Hi Bruno,

I am sorry I cannot give more guidance. It's a quite erratic random walk for me also. Could you please add your code with the changed branch predicates for me to try a few things ?

Matthias

BrunoLevy commented 4 years ago

Hi Matthias,

I've pushed the code, you can activate it by uncommenting the following line in femtosoc.v: `define NRV_TRY_COMPACT_PREDICATES If this work, we can probably play the same trick in the ALU, that computes in1-in2, signed comparison and unsigned comparison. (on my side, I'll try playing with abc flags, thank you for the link !)

-- B

Mecrisp commented 4 years ago

The original version as-is weighs in at 1332 LUTs here and doesn't fit on Icestick. Then I activated the NRV_TRY_COMPACT_PREDICATES and LUT usage dropped to 1262. Further adding "-abc2 -relut" to Yosys gave 1265 LUTs.

I am currently using

yosys --version Yosys 0.9+2406 (git sha1 UNKNOWN, clang 7.0.1-8 -fPIC -Os)

It seems as if this varies a lot with Yosys revisions.

Can you use the same COMPACT_PREDICATES wires for sub, slt and sltu opcodes, too ? I am not sure what the aluInSel1 and aluInSel2 wires do when executing branches.

BrunoLevy commented 4 years ago

Weird, on my side it always increases LUT count, and I have the same version of YOSYS (but compiled with a different CLANG): yosys --version Yosys 0.9+2406 (git sha1 UNKNOWN, clang 9.0.1-12 -fPIC -Os)

Note: sometimes the order of things / name of things change the LUT count, and different compilers may order things differently (C++ std::map, std::set etc...), we observed that already. I'm keeping the option for now, and will add a similar option for the ALU.

-- B P.S. Which devices did you activate ? Did you activate NRV_TWOSTAGE_SHIFTER as well ?

Mecrisp commented 4 years ago

I took your code as-is and activated NRV_TRY_COMPACT_PREDICATES only.

Mecrisp commented 4 years ago

With NRV_TWOSTAGE_SHIFTER activated as well along with NRV_TRY_COMPACT_PREDICATES, I get 1246 LUTs (with -abc2 -relut) or 1261 LUTs (without).

BrunoLevy commented 4 years ago

On my side here is what it gives (so much difference !)

.------------------- NRV_TRY_COMPACT_PREDICATES	.----------- NRV_TWOSTAGE_SHIFTER

OFF OFF : 1227 OFF ON : 1285 ON OFF : 1264 ON ON : 1339

Mecrisp commented 4 years ago

Oh no ! I hope "mithro" can comment on this.

One more idea to try might be to merge branches & alu opcodes and use a dedicated adder for PC instead. This would save three subtractions for sub, slt and sltu and add one addition for PC handling. But I have no clue on the total effect on LUT usage, as multiplexers need gates, too.

Mecrisp commented 4 years ago

I am quite sure we are using different minor revisions of Yosys, despite they report the same 0.9+2406 version string. Using a different clang to compile Yosys should not alter its algorithms.

Mecrisp commented 4 years ago

@mithro I am missing an human readable output which tells how resources are distributed on the design, to give a better feedback for manual optimisation.

BrunoLevy commented 4 years ago

"Using a different clang to compile YOSYS should not alter its algorithms" It should not, but I'm pretty sure it does ! Some explanations: if you use a C++ std::set, depending on the version of the compiler and libc++, the order of the elements in the set may be different, and it seems that YOSYS is quite sensitive to that. I observed that changing the names of some regs and wires gives a completely different LUT count !

Mecrisp commented 4 years ago

Perhaps file an issue for Yosys ? When changing names of signals changes the LUT count, this is at least "unexpected behaviour". I have not observed this before and tried to exchange a few names in femtorv32.v.

On my side, I replaced "imm" by "dergrosselangezwischenwertdergleichallesaendernwird" and "writeBackEn" by "wbe". I found that changing the names of wires did slightly change the reported time necessary for compilation, but md5sum of the resulting bitstream kept the same.

BrunoLevy commented 4 years ago

@Mecrisp,

Just pushed a new version with your trick applied to both the ALU and branch predicates, goes to 1227 -> 1159 LUTs (this time, when the optimization for the ALU is present, the optimization for the branch predicates gains LUTs as well). Many mysteries ... But now, core + SPI Flash controller + oled controller + two-level shifter fit for the first time, then it takes 1242 LUTs, maybe enough room for implementing new memory interface / execution from the SPI .
The subtract can probably be factorized between the ALU and branch predicate unit. This may require some datapath redesign though, probably merging the branch predicate unit into the ALU, and pulling-out more signals from the ALU, I'll think about it.
About filing a bug to YOSYS, yes, I need to create a minimal example with the behavior clearly identified. I also wanted to make something cleaner than my macros, and started to have #( parameter [0:0] xxxx ) in the design, but when I do that it increases LUT count ! Need also to create a minimal example.

mithro commented 4 years ago

Couple of tips about Yosys.

You should always be using Yosys from git -- It is a fast moving target and gets better all the time.
The #yosys IRC channel on irc.freenode.net is probably a good place to ask questions about Yosys behaviour and tips for reducing LUT count.

mithro commented 4 years ago

BTW - Have you discovered how to get Yosys to write out images of the netlist? Matthew Venn has a lot of good tips at https://www.mattvenn.net/2017/11/05/beginning-fpga/ and some youtube videos around too.

mithro commented 4 years ago

http://www.clifford.at/yosys/files/yosys_appnote_011_design_investigation.pdf

BrunoLevy commented 4 years ago

@mithro Yes I updated my YOSYS a couple of days ago (before then I was using the one bundled with Debian, but it was too old to talk to my ECP5 so I recompiled it from the sources). About images of the netlist, I'm afrait they will be too large to be understandable (at least by me !), but I'll give it a try. Best, -- B

BrunoLevy commented 4 years ago

@Mecrisp

My 'hi-score' (or I should say 'lo-score') is 1111 LUTs (with the same config. as this afternoon)., except the two COMPACT_ALU and COMPACT_PREDICATES, my other changes should not have impacted the LUT count (but it did).
Finally I could not merge the two subtractions in ALU and predicates. It would not be impossible, but would make the design too messy (and one of my goal is to have a design that is easy to understand). My plan was to merge them, but I need them simultaneously (for branches, to compute the branch condition and branch target).

Mecrisp commented 4 years ago

At the weekend I was busy with picking cherries and cooking jam, but now I am back: My congratulations for your current low-score !

For your quest to execute from SPI flash, you'll need to also signal read on instruction fetch, and to add a wait input line to the CPU while SPI is busy fetching the requested word.

BrunoLevy commented 4 years ago

Working on that ... (requires a full redesign of the FSM, will take me a while).

julian1 commented 3 years ago

Hi, I just saw this project, which looks amazing!.

I did a little bit of verilog on the icestick, using yosys and icestorm a few years ago, but never got as far as softcores.

   always @(*) begin
      (* parallel_case, full_case *)
      case(op)
    3'b000: out = opqual ? minus[31:0] : in1 + in2;  // ADD/SUB
    3'b010: out = LT ;                               // SLT

Do I understand correctly, that the ALU addition is non-pipelined, and runs in a single clock cycle (less instruction decoding). And that it runs at 80MHz on the icestick when meeting timing constraints?

I have a very simple assembly level program, that only requires a fast add. So I am wondering if perhaps femtoRV32 could work, and without the extra complexity and overhead of a traditional mcu?

Mecrisp commented 3 years ago

I am glad to see this project flourish, and the tutorial section is really nice. Is the idea of allowing execution from SPI flash still on your radar ?

BrunoLevy commented 3 years ago

Yes, it is clearly one of the big items in my todo list (it is mentioned at the end of the design tutorial). It is still not done because it requires some drastic changes in both the FSM and the memory interface, with a memready signal. I'm still hesitating about the memory protocol to use, probably Claire Wolf's in PicoRV32, but the problem is that it needs a comparator to see when the required address changes. I know it would be a great thing to have, if would give us virtually unlimited program storage ! One this works, another thing to do will be to activate faster 4-wires SPI modes, but I'm unsure I'll have enough LUTs for that...

BrunoLevy commented 3 years ago

@julian1, yes it is reasonable to do a 32 bits addition in one cycle on the IceStick. From the datasheet on Lattice website, it is said that HX devices (such as the ICE40HX1K of the IceStick) can do 16-bits adds at 220 MHz ! (it is not indicated in the datasheet, but I'd naively guess that this drops to 100 MHz for 32-bits adds). About maxfreq, it is validated at 50MHz (80MHz is "overclocking", but it is experimentally stable). I guess that by inserting some registers at the right location I could validate it somewhere between 50MHz and 80MHz (I tried without success).

Mecrisp commented 3 years ago

I found you mentioned me in the tutorial for the trick to unify subtraction/comparison. Thank you for that :-)

I am not sure if you like the idea, but I think you could save LUTs in Icestick if you remove the "error" condition, and assume opcodes always being valid.

7'b0110111: begin // LUI 7'b0010111: begin // AUIPC 7'b1101111: begin // JAL 7'b1100111: begin // JALR 7'b1100011: begin // Branch 7'b0010011: begin // ALU operation: Register,Immediate 7'b0110011: begin // ALU operation: Register,Register 7'b0000011: begin // Load 7'b0100011: begin // Store

You could completely remove the comparison for bit 0 and 1 (Long/Compressed Opcode), always 2'b11 in your case, and perhaps use bit 3 to distinguish between JAL and JALR only, which are similar.

7'b1101111: begin // JAL

inRegId1Sel = 1'bx; // reg 1 Id : don't care (we use PC)
aluInSel1 = 1'b1; // ALU source 1 = PC

imm = Jimm; // imm format = J end

7'b1100111: begin // JALR

aluInSel1 = 1'b0; // ALU source 1 = reg

imm = Iimm; // imm format = I end

The differences boil down to aluInSel1 = instr[3] and imm = instr[3] ? Jimm : Iimm;

BrunoLevy commented 3 years ago

@mecrisp Yes great idea ! It makes perfect sense to me ! (reminds me the 6502 design with hand written optimized instruction decoder with its pair of undefined instructions) I think I'll add that as an option, and implement it as an alternative version of the instruction decoder, instanced depending on the value of the parameter. For now I like to have the 'safe' one, that triggers a blinky each time I have a hardware or software bug (or both !!)

BrunoLevy commented 3 years ago

I've just tried, did not obtain any gain (but unsure whether I tried right !). My interpretation (may be completely wrong): ICE40 LUTs have 4 inputs, so 7 bits instr decoder uses two of them per instructions, but 5 bits instr decoder also needs two of them ! ... or maybe YOSYS already makes optimizations...

Mecrisp commented 3 years ago

Yosys might be fantastically clever. May you post your new instruction decoder snipplet please, for me to experiment with ?

olofk commented 3 years ago

Sometimes the optimizations just don't turn out as good as you hoped. But there are always other things to try :)

BrunoLevy commented 3 years ago

I've just pushed it (FemtoRV/RTL/mini_decoder.v), To test it, you need to edit FemtoRV/RTL/femtorv32.v, Line 43, replace include "decoder.v" withinclude "mini_decoder.v" (note that all the verilog files are now in a RTL subdirectory, I did that because I split femtorv32.v into several files, to ease testing drop-in replacements of some of the components).

mithro commented 3 years ago

You might find http://www.fpgacpu.org/papers/xsoc-series-drafts.pdf interesting?

BrunoLevy / learn-fpga

Congratulations and Ideas #1