BrunoLevy / learn-fpga

Learning FPGA, yosys, nextpnr, and RISC-V
BSD 3-Clause "New" or "Revised" License
2.53k stars 242 forks source link

Congratulations and Ideas #1

Open Mecrisp opened 4 years ago

Mecrisp commented 4 years ago

Dear Bruno,

my congratulations for squeezing a RV32I core into the Icestick !

I read your Verilog files with joy and I wish to share an idea on how to save a few more LUTs for more peripherals: Try an "one-hot" IO address decoder. You have few IO registers only, so you can reserve one address line for each of your peripheral registers and save LUTs on comparisons with the full IO address. This also allows to set multiple IO registers at once.

You can also insert a hardware random number generator by using a ring oscillator.

Maybe you wish to check out Mecrisp-Ice from mecrisp.sourceforge.net in file mecrisp-ice-1.8/hx1k/icestorm/j1a.v for my peripheral set in use on the Icestick. Mecrisp-Ice is a Forth compiler running on a stack processor, which is a descendant of Swapforth and the J1a CPU by James Bowman. I think you can borrow a few of the ideas !

If you manage to map the SPI flash into the memory bus within the available LUTs, similiar to the memory interface in Picosoc, I would be happy to officially port Mecrisp-Quintus (a RISC-V Forth which needs about 24kb flash and 4 kb RAM) to your FemtoRV32 on the Icestick.

Hats off and best wishes from Germany, Matthias

PS: Completely removing the rdRAM wire in your memory design somehow saved 20 LUTs.

BrunoLevy commented 3 years ago

Thanks !

BrunoLevy commented 3 years ago

It is a great article ! Very inspiring.

Mecrisp commented 3 years ago

I tried to compare LUTs for decoder.v with mini_decoder.v as-is and found it to save 21 LUTs with my (old) version of Yosys. I changed it a little bit more, using casez and ? as attached with a result of 25 LUTs saved.


/********************* Instruction decoder *******************************/
/* A drop-in replacement of the instruction decoder, meant to further    */
/* reduce LUT count by not checking for errors (but no success for now)  */

module NrvDecoder(
    input wire [31:0] instr,
    output wire [4:0] writeBackRegId,
    output reg        writeBackEn,
    output reg [3:0]  writeBackSel, // 0001: ALU  0010: PC+4  0100: RAM 1000: counters
                            // (could use 2 wires instead, but using 4 wires (1-hot encoding)
                            //  reduces both LUT count and critical path in the end !)
    output wire [4:0] inRegId1,
    output wire [4:0] inRegId2,
    output reg        aluSel, // 0: force aluOp,aluQual to zero (ADD)  1: use aluOp,aluQual from instr field
    output reg        aluInSel1, // 0: reg  1: pc
    output reg        aluInSel2, // 0: reg  1: imm
    output [2:0]      aluOp,
    output reg        aluQual,
    output wire       aluM, // Asserted if operation is an RV32M operation
    output reg        isLoad,
    output reg        isStore,
    output reg        needWaitALU,
    output reg [2:0]  nextPCSel, // 001: PC+4  010: ALU  100: (predicate ? ALU : PC+4)
                         // (same as writeBackSel, 1-hot encoding)
    output reg [31:0] imm,
    output wire       error
);

   assign error = 1'b0; // We do not check for errors in the MiniDecoder.
   assign aluM  = 1'b0; // MiniDecoder only works for RV32I

   reg inRegId1Sel; // 0: force inRegId1 to zero 1: use inRegId1 instr field

   assign writeBackRegId = instr[11:7];
   assign inRegId1       = instr[19:15] & {5{inRegId1Sel}}; // Internal sig InRegId1Sel used to force zero in reg1
   assign inRegId2       = instr[24:20];             // (because I'm making maximum reuse of the adder of the ALU)
   assign aluOp          = instr[14:12];

   wire [31:0] Iimm = {{21{instr[31]}}, instr[30:20]};
   wire [31:0] Simm = {{21{instr[31]}}, instr[30:25], instr[11:7]};
   wire [31:0] Bimm = {{20{instr[31]}}, instr[7], instr[30:25], instr[11:8], 1'b0};
   wire [31:0] Jimm = {{12{instr[31]}}, instr[19:12], instr[20], instr[30:21], 1'b0};
   wire [31:0] Uimm = {instr[31], instr[30:12], {12{1'b0}}};

   // The rest of instruction decoding, for the following signals:
   // writeBackEn
   // writeBackSel   0001: ALU  0010: PC+4 0100: RAM 1000: counters
   // inRegId1Sel    0: zero   1: regId
   // aluInSel1      0: reg    1: PC
   // aluInSel2      0: reg    1: imm
   // aluQual        +/- SRLI/SRAI
   // aluM           1 if instr is RV32M
   // aluSel         0: force aluOp,aluQual=00  1: use aluOp/aluQual
   // nextPCSel      001: PC+4  010: ALU   100: (pred ? ALU : PC+4)
   // imm (select one of Iimm,Simm,Bimm,Jimm,Uimm)

   // We need to distingish shifts for two reasons:
   //  - We need to wait for ALU when it is a shift
   //  - For ALU ops with immediates, aluQual is 0, except
   //    for shifts (then it is instr[30]).
   wire aluOpIsShift = (aluOp == 3'b001) || (aluOp == 3'b101);

   always @(*) begin

       nextPCSel = 3'b001; // default: PC <- PC+4
       inRegId1Sel = 1'b1; // reg 1 Id from instr
       isLoad = 1'b0;
       isStore = 1'b0;
       aluQual = 1'b0;
       needWaitALU = 1'b0;

       (* parallel_case, full_case *)
       casez(instr[6:2])
       5'b011?1: begin // LUI
          writeBackEn  = 1'b1;    // enable write back
          writeBackSel = 4'b0001; // write back source = ALU
          inRegId1Sel = 1'b0;     // reg 1 Id = 0
          aluInSel1 = 1'b0;       // ALU source 1 = reg
          aluInSel2 = 1'b1;       // ALU source 2 = imm
          aluSel = 1'b0;          // ALU op = ADD
          imm = Uimm;             // imm format = U
       end

       5'b001?1: begin // AUIPC
          writeBackEn  = 1'b1;    // enable write back
          writeBackSel = 4'b0001; // write back source = ALU
          inRegId1Sel = 1'bx;     // reg 1 Id : don't care (we use PC)
          aluInSel1 = 1'b1;       // ALU source 1 = PC
          aluInSel2 = 1'b1;       // ALU source 2 = imm
          aluSel = 1'b0;          // ALU op = ADD
          imm = Uimm;             // imm format = U
       end

       5'b11011: begin // JAL
          writeBackEn  = 1'b1;    // enable write back
          writeBackSel = 4'b0010; // write back source = PC+4
          inRegId1Sel = 1'bx;     // reg 1 Id : don't care (we use PC)
          aluInSel1 = 1'b1;       // ALU source 1 = PC
          aluInSel2 = 1'b1;       // ALU source 2 = imm
          aluSel = 1'b0;          // ALU op = ADD
          nextPCSel = 3'b010;     // PC <- ALU
          imm = Jimm;             // imm format = J
       end

       5'b11001: begin // JALR
          writeBackEn  = 1'b1;    // enable write back
          writeBackSel = 4'b0010; // write back source = PC+4
          aluInSel1 = 1'b0;       // ALU source 1 = reg
          aluInSel2 = 1'b1;       // ALU source 2 = imm
          aluSel = 1'b0;          // ALU op = ADD
          nextPCSel = 3'b010;     // PC <- ALU
          imm = Iimm;             // imm format = I
       end

       5'b110?0: begin // Branch
          writeBackEn = 1'b0;     // disable write back
          writeBackSel = 4'bxxxx; // write back source = don't care
          aluInSel1 = 1'b1;       // ALU source 1 : PC
          aluInSel2 = 1'b1;       // ALU source 2 : imm
          aluSel = 1'b0;          // ALU op = ADD
          nextPCSel = 3'b100;     // PC <- pred ? ALU : PC+4
          imm = Bimm;             // imm format = B
       end

       5'b001?0: begin // ALU operation: Register,Immediate
          writeBackEn = 1'b1;     // enable write back
          writeBackSel = 4'b0001; // write back source = ALU
          aluInSel1 = 1'b0;       // ALU source 1 : reg
          aluInSel2 = 1'b1;       // ALU source 2 : imm
                                  // Qualifier for ALU op: SRLI/SRAI
          aluQual = aluOpIsShift ? instr[30] : 1'b0;
          needWaitALU = aluOpIsShift;
          aluSel = 1'b1;         // ALU op : from instr
          imm = Iimm;            // imm format = I
       end

       5'b011?0: begin // ALU operation: Register,Register
          writeBackEn = 1'b1;     // enable write back
          writeBackSel = 4'b0001; // write back source = ALU
          aluInSel1 = 1'b0;       // ALU source 1 : reg
          aluInSel2 = 1'b0;       // ALU source 2 : reg
          aluQual = instr[30];    // Qualifier for ALU op: +/- SRL/SRA
          aluSel = 1'b1;          // ALU op : from instr
          needWaitALU = aluOpIsShift;
          imm = 32'bxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx; // don't care
       end

       5'b000?0: begin // Load
          writeBackEn = 1'b1;     // enable write back
          writeBackSel = 4'b0100; // write back source = RAM
          aluInSel1 = 1'b0;       // ALU source 1 = reg
          aluInSel2 = 1'b1;       // ALU source 2 = imm
          aluSel = 1'b0;          // ALU op = ADD
          imm = Iimm;             // imm format = I
          isLoad = 1'b1;
       end

       5'b010?0: begin // Store
          writeBackEn = 1'b0;     // disable write back
          writeBackSel = 4'bxxxx; // write back sel = don't care
          aluInSel1 = 1'b0;       // ALU source 1 = reg
          aluInSel2 = 1'b1;       // ALU source 2 = imm
          aluSel = 1'b0;          // ALU op = ADD
          imm = Simm;             // imm format = S
          isStore = 1'b1;
       end

       default: begin
          writeBackEn = 1'b0;
          writeBackSel = 4'bxxxx;
          inRegId1Sel = 1'bx;
          aluInSel1 = 1'bx;
          aluInSel2 = 1'bx;
          aluSel = 1'bx;
          nextPCSel = 3'bxxx;
          imm = 32'bxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx;
       end
       endcase
   end

endmodule

I can build the Verilog part easily, but today I tried to build the Firmware for the first time, and have a few observations:

My take faults (with decoder.v) immediately, however:

.section .text

#################################################################################

# Mapped IO constants

.equ IO_BASE,      0x400000  # Base address of memory-mapped IO
.equ IO_LEDS,      4         # 4 LSBs mapped to D1,D2,D3,D4
.equ IO_OLED_CNTL, 8         # OLED display control.
                             #  wr: 01: reset low 11: reset high 00: normal operation
                             #  rd:  0: ready  1: busy
.equ IO_OLED_CMD,  16        # OLED display command. Only 8 LSBs used.
.equ IO_OLED_DATA, 32        # OLED display data. Only 8 LSBs used.
.equ IO_UART_CNTL, 64        # USB UART control. busy (bit 9), data ready (bit 8)
.equ IO_UART_DATA, 128       # USB UART data (read/write)
.equ IO_LEDMTX_CNTL, 256     # LED matrix control. read: LSB bit 1 if busy
.equ IO_LEDMTX_DATA, 512     # LED matrix data (write)

################################################################################

   li x1, IO_BASE
   li x2, 0

1: sw x2, IO_LEDS(x1)
   addi x2, x2, 1

   li x3, 0xFFFFF
2: addi x3, x3, -1
   bne x3, zero, 2b

   j 1b

Memmap:


MEMORY
{
   rom(RX)   : ORIGIN = 0x00000000, LENGTH = 0x400
}

SECTIONS
{
   .text : { *(.text*) } > rom
}

Commands to assemble:

riscv64-linux-gnu-as blinky.s -o blinky.o -march=rv32i
riscv64-linux-gnu-ld -o blinky.elf -T memmap blinky.o -m elf32lriscv
riscv64-linux-gnu-objdump -Mnumeric -D blinky.elf > blinky.list
riscv64-linux-gnu-objcopy blinky.elf blinky.hex -O verilog
Mecrisp commented 3 years ago

Wow ! Nice progress ! On the peripheral side, I recommend adding a simple GPIO port with IN, OUT and DIR registers to the mix for the tutorial.

BrunoLevy commented 3 years ago
Mecrisp commented 3 years ago

I am looking forward to that ! Yes, your new modular design is much more understandable, a very good implementation for experiments and teaching the fundamentals. Good luck for the next steps, and: Joyeux Noël !

BrunoLevy commented 3 years ago

Hi Matthias, Joyeux Noël to you too ! Mapped memory interface for the SPI flash is functional, execute from SPI will come next (need to insert a couple of 'wait for SPI' states in the FSM).

whatnick commented 3 years ago

Merry Christmas and Happy new year. I am starting to take notes on running the femtorv32 on the ice40-feather. Will make a tutorial PR when ready. Adopting the feather eco-system gives access to lots of tried and tested peripherals and has many more users at hobby level than the PMOD ecosystem. Hopefully I can get somewhere useful.

BrunoLevy commented 3 years ago

Hi Matthias, Run from SPI flash seems to work ! To test it:

1) edit RTL/femtosoc_config.v uncomment the following lines define NRV_MAPPED_SPI_FLASH (butNRV_IO_SPI_FLASH should be commented) define NRV_RUN_FROM_SPI define NRV_MINIRV32 (for now, run from SPI is only implemented for the new mini-femtorv32 core, that has a simpler FSM)

2) the SPI flash starting at address 1M is mapped at address 0x80000, so to test it:

3) compute a firmware that jumps to the mapped SPI flash

4) let's rock'n'roll !

Notes: The way I'm generating the .bin file is not correct ! (crt0.S is copied one more time, and the linker does not know it is going to go at address 0x80000). It is OK because with a blinky, the code is relocable, but for compiling the Forth interpreter, we will need a correct linker script, that puts the code at address 0x80000, and that lets the rest in the RAM starting at address 0 (but maybe you already have something like that for J1).

BrunoLevy commented 3 years ago

Oops, wait a minute, seems I made some mistakes (it is 0x800000, not 0x80000), but I'm jumping to 0x80000 and it blinks, not normal, it should not ! Need to understand what's going in... Will come back shortly with more news.

BrunoLevy commented 3 years ago

Works also with 0x800000, I pushed the files, so that you can test if you want (now I need to understand why it worked also with 0x80000, maybe my 1-hot address encoding makes it possible, need to understand).

Mecrisp commented 3 years ago

Hi Bruno,

thank you for the large effort to get this up and running !

Now on to try your achievement:

I put a "blank" memory image into firmware.hex and activated both

`define NRV_MINIRV32
`define NRV_MAPPED_SPI_FLASH

Synthesis is fine by using

make ICESTICK.synth

But now a few questions:

How is the memory map ?

Given

   wire mem_address_is_ram       = (mem_address[23:22] == 2'b00);   
   wire mem_address_is_io        = (mem_address[23:22] == 2'b01);
   wire mem_address_is_spi_flash = (mem_address[23:22] == 2'b10);

I think it is this way, correct ?

0x00000000 to 0x000017FF Block RAM, 6 kb 0x00800000 to 0x00BFFFFF Mapped SPI memory

With this, do I get the bitstream starting at 0x00800000, or is there an offset, shifting the bitstream out of "mapped view" ?

How do I configure the Reset address of FemtoRV to start executing from 0x00800000 or 0x00800000 + Offset-to-the-end-of-the-bitstream ?

I assume changing this piece will do the trick:

   always @(posedge clk) begin
      if(!reset) begin  
     state <= INITIAL;
     addressReg <= 0;
     PC <= 0;
      end else

By the way, I think no precompiled firmware.hex should be necessary when using the mapped SPI memory feature.

Matthias

BrunoLevy commented 3 years ago

Hi Matthias,

Best, -- B

Mecrisp commented 3 years ago

Hooray ! It works ! Blinky in assembler written to -o 1M is up and running !

PS: When using

`define NRV_RESET_ADDR 0x800000

it gives error

RTL/PROCESSOR/mini_femtorv32.v:290: ERROR: syntax error, unexpected TOK_ID

I changed this to

`define NRV_RESET_ADDR 32'h00800000

and it synthesises nicely.

Mecrisp commented 3 years ago

The total quantity of RAM can be queried at address IO_BASE + IO_RAM (or you can also hardwire 6K)

Hardwired. Better save the LUTs for a GPIO port.

BrunoLevy commented 3 years ago

Great ! Very happy it starts working, looking forward to see Forth running on it ! I pushed a new version with:

Mecrisp commented 3 years ago

I am not sure on how to use the busy flag of the UART. When adding a delay, it transmits correctly, but without the delay, this code transmits garbage in the terminal.


.section .text

#################################################################################

# Mapped IO constants

.equ IO_BASE,      0x400000  # Base address of memory-mapped IO
.equ IO_LEDS,      4         # 4 LSBs mapped to D1,D2,D3,D4
.equ IO_OLED_CNTL, 8         # OLED display control.
                             #  wr: 01: reset low 11: reset high 00: normal operation
                             #  rd:  0: ready  1: busy
.equ IO_OLED_CMD,  16        # OLED display command. Only 8 LSBs used.
.equ IO_OLED_DATA, 32        # OLED display data. Only 8 LSBs used.
.equ IO_UART_CNTL, 64        # USB UART control. busy (bit 9), data ready (bit 8)
.equ IO_UART_DATA, 128       # USB UART data (read/write)
.equ IO_LEDMTX_CNTL, 256     # LED matrix control. read: LSB bit 1 if busy
.equ IO_LEDMTX_DATA, 512     # LED matrix data (write)

################################################################################

                 # x1: Link register
   li x2, 0x1800 # x2: Stack pointer, at the end of 6 kb

   li x3, IO_BASE
   li x4, 0

1: sw x4, IO_LEDS(x3)

   # Wait for busy flag being cleared
2: lw x5, IO_UART_CNTL(x3)
   andi x5, x5, 0x200 # Bit 9: Busy
   bne x5, zero, 2b

   sw x4, IO_UART_DATA(x3)

   # Small delay
   li x5, 0x4000
3: addi x5, x5, -1
   bne x5, zero, 3b

   # Next character
   addi x4, x4, 1
   j 1b
Mecrisp commented 3 years ago

Valid flag does not work as expected, too:

.section .text

#################################################################################

# Mapped IO constants

.equ IO_BASE,      0x400000  # Base address of memory-mapped IO
.equ IO_LEDS,      4         # 4 LSBs mapped to D1,D2,D3,D4
.equ IO_OLED_CNTL, 8         # OLED display control.
                             #  wr: 01: reset low 11: reset high 00: normal operation
                             #  rd:  0: ready  1: busy
.equ IO_OLED_CMD,  16        # OLED display command. Only 8 LSBs used.
.equ IO_OLED_DATA, 32        # OLED display data. Only 8 LSBs used.
.equ IO_UART_CNTL, 64        # USB UART control. busy (bit 9), data ready (bit 8)
.equ IO_UART_DATA, 128       # USB UART data (read/write)
.equ IO_LEDMTX_CNTL, 256     # LED matrix control. read: LSB bit 1 if busy
.equ IO_LEDMTX_DATA, 512     # LED matrix data (write)

################################################################################

                 # x1: Link register
   li x2, 0x1800 # x2: Stack pointer, at the end of 6 kb

   li x3, IO_BASE

   li x4, 42 # Emit a * on first loop run

1: sw x4, IO_LEDS(x3)
   sw x4, IO_UART_DATA(x3)

2: # Wait for valid flag being set
   lw x5, IO_UART_CNTL(x3)
   andi x5, x5, 0x100 # Bit 8: Valid

     addi x6, x6, 1 # Spin LEDs as indicator
     sw x6, IO_LEDS(x3)

   beq x5, zero, 2b

   lw x4, IO_UART_DATA(x3)
   addi x4, x4, 1 # Echo back a different character

   j 1b
Mecrisp commented 3 years ago

By the way, I use the PicoSoC-UART by Claire Wolf in Mecrisp-Ice, which is smaller than the one of James Bowman, at least for me:


/*
 *  PicoSoC - A simple example SoC using PicoRV32
 *
 *  Copyright (C) 2017  Clifford Wolf <clifford@clifford.at>
 *
 *  Permission to use, copy, modify, and/or distribute this software for any
 *  purpose with or without fee is hereby granted, provided that the above
 *  copyright notice and this permission notice appear in all copies.
 *
 *  THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
 *  WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
 *  MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
 *  ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
 *  WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
 *  ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
 *  OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
 *
 */

 // October 2019, Matthias Koch: Renamed wires

module buart (
    input clk,
    input resetq,

    output tx,
    input  rx,

    input  wr,
    input  rd,
    input  [7:0] tx_data,
    output [7:0] rx_data,

    output busy,
    output valid
);

    reg [3:0] recv_state;
    reg [$clog2(`cfg_divider)-1:0] recv_divcnt;   // Counts to cfg_divider. Reserve enough bytes !
    reg [7:0] recv_pattern;
    reg [7:0] recv_buf_data;
    reg recv_buf_valid;

    reg [9:0] send_pattern;
    reg [3:0] send_bitcnt;
    reg [$clog2(`cfg_divider)-1:0] send_divcnt;   // Counts to cfg_divider. Reserve enough bytes !
    reg send_dummy;

    assign rx_data = recv_buf_data;
    assign valid = recv_buf_valid;
    assign busy = (send_bitcnt || send_dummy);

    always @(posedge clk) begin
        if (!resetq) begin

            recv_state <= 0;
            recv_divcnt <= 0;
            recv_pattern <= 0;
            recv_buf_data <= 0;
            recv_buf_valid <= 0;

        end else begin
            recv_divcnt <= recv_divcnt + 1;

            if (rd) recv_buf_valid <= 0;

            case (recv_state)
                0: begin
                    if (!rx)
                        recv_state <= 1;
                    recv_divcnt <= 0;
                end
                1: begin
                    if (2*recv_divcnt > `cfg_divider) begin
                        recv_state <= 2;
                        recv_divcnt <= 0;
                    end
                end
                10: begin
                    if (recv_divcnt > `cfg_divider) begin
                        recv_buf_data <= recv_pattern;
                        recv_buf_valid <= 1;
                        recv_state <= 0;
                    end
                end
                default: begin
                    if (recv_divcnt > `cfg_divider) begin
                        recv_pattern <= {rx, recv_pattern[7:1]};
                        recv_state <= recv_state + 1;
                        recv_divcnt <= 0;
                    end
                end
            endcase
        end
    end

    assign tx = send_pattern[0];

    always @(posedge clk) begin
        send_divcnt <= send_divcnt + 1;
        if (!resetq) begin
            send_pattern <= ~0;
            send_bitcnt <= 0;
            send_divcnt <= 0;
            send_dummy <= 1;
        end else begin
            if (send_dummy && !send_bitcnt) begin
                send_pattern <= ~0;
                send_bitcnt <= 15;
                send_divcnt <= 0;
                send_dummy <= 0;
            end else
            if (wr && !send_bitcnt) begin
                send_pattern <= {1'b1, tx_data[7:0], 1'b0};
                send_bitcnt <= 10;
                send_divcnt <= 0;
            end else
            if (send_divcnt > `cfg_divider && send_bitcnt) begin
                send_pattern <= {1'b1, send_pattern[9:1]};
                send_bitcnt <= send_bitcnt - 1;
                send_divcnt <= 0;
            end
        end
    end
endmodule
Mecrisp commented 3 years ago

i changed the wire names to be a drop-in replacement.

Mecrisp commented 3 years ago

Oh, I just see it: You changed the UART interface. But how do I read the valid/busy flags without actually fetching the character ? In Forth, there are traditionally four routines for terminal: EMIT? EMIT KEY? KEY and its important to be able to check the flags without actually transmitting/receiving something. Maybe the other UART will give you enough LUTs to re-insert the UART flag register.

Mecrisp commented 3 years ago

You could use different write strobes for that. Address strobe +0 should have fetch/transmit side effects, address strobe +1 for the flags should not. Then both behaviours are available for the software side depending on using lb/sb and lh/sh.

BrunoLevy commented 3 years ago

Hi Matthias,

olofk commented 3 years ago

If you want to save some more LUTs you can always do like I did in the SERVant SoC and bitbang the UART with a single GPIO instead and drive it like this https://github.com/olofk/serv/blob/master/zephyr/drivers/serial/uart_bitbang.c#L21 with the correct amount of NOPs for your CPU speed :)

Mecrisp commented 3 years ago

@olofk Hey, thanks, nice to read you here ! We know SERV and I am very impressed with it, but we'll just try a drop-in exchange with a more traditional UART :-)

BrunoLevy commented 3 years ago

Hi @olofk, very happy to hear from you, and thanks a lot for the pointer to your UART bitbanging code ! For now what I'm trying to do is to balance speed/LUT count/number of Verilog lines/legibility (the goal is to transform the material into a course). Clearly bitbanging can be sometimes a good option ! (I'm doing that to talk to the SDCard), it will depend on how many LUTs remain in the end !

olofk commented 3 years ago

A proper UART is definitely the better choice unless the main goal is to minimize resource usage. Wasn't sure how much extra space you had on the small iCE40 devices.

BrunoLevy commented 3 years ago

@Mecrisp, how do you configure `cfg_divider, is it simply (clock freq / bauds) or is it something more subtle ?

Mecrisp commented 3 years ago

Exactly that. Nothing special.

BrunoLevy commented 3 years ago

Claire's UART inferfaced.

BrunoLevy commented 3 years ago

... also trying some LUT-golfing in Claire's code. -> 1225 LUTs so far... (not stellar, trying other things...)

Mecrisp commented 3 years ago

Try another baudrate. You may get surprising results. My idea was that a faster baudrate results in a smaller divider and hence in less logic for counter and comparison, but I just tried it on Mecrisp-Ice 1.8c for HX1K and got:

1273 LUTs with a divider of

`define cfg_divider 208 // 48 MHz / 230400

and 1227 LUTs with

`define cfg_divider 416 // 48 MHz / 115200

Matthias

BrunoLevy commented 3 years ago

It is very difficult to forecast which configuration will give what ! Well for now I'm stuck around 1220 LUTs, I have pushed the new version. It is still possible to use the UART from J1 (there is a toggle in RTL/DEVICES/uart.v)

There is also something I do not understand, to generate the "half baud" clock for receive, there is this test: wire recv_half_baudclk = recv_divcnt > divider/2; Normally, it is possible (and less costly in terms of LUTs) to replace it with this one, since recv_divcnt is reset to zero at the next cycle: wire recv_half_baudclk = (recv_divcnt == divider/2 + 1);

But when I do that, things become unstable (sometime I receive random characters).

Many mysteries to be investigated ! -- Bruno P.S. Now working on reviving the control register.

BrunoLevy commented 3 years ago

Pushed new version with control register. Current LUT golfing par: UART, LEDS, 90 MHz, MINIRV32 => 1084 LUTs UART, LEDS, Mapped SPI, 90 MHz, MINIRV32 => 1232 LUTs

-- B

BrunoLevy commented 3 years ago

Merged the baud generator for send and receive, gained 30 LUTs or so... ... now trying to merge bitcount (LUT golfing is so addictive...)

olofk commented 3 years ago

... now trying to merge bitcount (LUT golfing is so addictive...)

Ain't going to argue with that :)

LUT golding tip for n-bit counters that count to k: Make it an n+1-bit downcounter, load it with k-1 and check when msb is set (wraparound). Costs an extra adder bit but saves a n-bit comparison

Mecrisp commented 3 years ago

I confirm UART flags are working. Now on to port Mecrisp-Quintus !


.section .text

#################################################################################

# Mapped IO constants

.equ IO_BASE,         0x400000  # Base address of memory-mapped IO
.equ IO_LEDS,         4         # 4 LSBs mapped to D1,D2,D3,D4
.equ IO_OLED_CNTL,    8         # OLED display control.
                                #  wr: 01: reset low 11: reset high 00: normal operation
                                #  rd:  0: ready  1: busy
.equ IO_OLED_CMD,       16      # OLED display command. Only 8 LSBs used.
.equ IO_OLED_DATA,      32      # OLED display data. Only 8 LSBs used.
.equ IO_DEVICES_FREQ,   64      # HW config: devices and frequency
.equ IO_UART_CNTL,    8192      # USB UART data (read/write)
.equ IO_UART_DATA,     128      # USB UART data (read/write)
.equ IO_RAM,           256      # HW config: Installed amount of RAM
.equ IO_LEDMTX_DATA,   512      # LED matrix data (write)

################################################################################

                 # x1: Link register
   li x2, 0x1800 # x2: Stack pointer, at the end of 6 kb

   li x3, IO_BASE
   li x9, IO_BASE + IO_UART_CNTL

   li x4, 0

1: # sw x4, IO_LEDS(x3)

   # Wait for busy flag being cleared
2: lw x5, 0(x9)

     srli x10, x5, 8
     sw x10, IO_LEDS(x3)

   andi x5, x5, 0x200 # Bit 9: Busy
   bne x5, zero, 2b

   sw x4, IO_UART_DATA(x3)

   # Small delay
   li x5, 0x4
3: addi x5, x5, -1
   bne x5, zero, 3b

   # Next character
   addi x4, x4, 1
   j 1b
Mecrisp commented 3 years ago

Oh, wait: This one does not work !

.section .text

#################################################################################

# Mapped IO constants

.equ IO_BASE,         0x400000  # Base address of memory-mapped IO
.equ IO_LEDS,         4         # 4 LSBs mapped to D1,D2,D3,D4
.equ IO_OLED_CNTL,    8         # OLED display control.
                                #  wr: 01: reset low 11: reset high 00: normal operation
                                #  rd:  0: ready  1: busy
.equ IO_OLED_CMD,       16      # OLED display command. Only 8 LSBs used.
.equ IO_OLED_DATA,      32      # OLED display data. Only 8 LSBs used.
.equ IO_DEVICES_FREQ,   64      # HW config: devices and frequency
.equ IO_UART_CNTL,    8192      # USB UART data (read/write)
.equ IO_UART_DATA,     128      # USB UART data (read/write)
.equ IO_RAM,           256      # HW config: Installed amount of RAM
.equ IO_LEDMTX_DATA,   512      # LED matrix data (write)

################################################################################

                 # x1: Link register
   li x2, 0x1800 # x2: Stack pointer, at the end of 6 kb

   li x3, IO_BASE
   li x9, IO_BASE + IO_UART_CNTL

   li x4, 0

1: # sw x4, IO_LEDS(x3)

   # Wait for busy flag being cleared
2: lw x5, 0(x9)

     srli x10, x5, 8
     sw x10, IO_LEDS(x3)

   andi x5, x5, 0x200 # Bit 9: Busy
   bne x5, zero, 2b

   sw x4, IO_UART_DATA(x3)

   # Next character
   addi x4, x4, 1
   j 1b
Mecrisp commented 3 years ago

When commenting out the instruction "bne x5, zero, 3b" the program stops working. Maybe a fault in CPU/fetch logic ?


.section .text

#################################################################################

# Mapped IO constants

.equ IO_BASE,         0x400000  # Base address of memory-mapped IO
.equ IO_LEDS,         4         # 4 LSBs mapped to D1,D2,D3,D4
.equ IO_OLED_CNTL,    8         # OLED display control.
                                #  wr: 01: reset low 11: reset high 00: normal operation
                                #  rd:  0: ready  1: busy
.equ IO_OLED_CMD,       16      # OLED display command. Only 8 LSBs used.
.equ IO_OLED_DATA,      32      # OLED display data. Only 8 LSBs used.
.equ IO_DEVICES_FREQ,   64      # HW config: devices and frequency
.equ IO_UART_CNTL,    8192      # USB UART data (read/write)
.equ IO_UART_DATA,     128      # USB UART data (read/write)
.equ IO_RAM,           256      # HW config: Installed amount of RAM
.equ IO_LEDMTX_DATA,   512      # LED matrix data (write)

################################################################################

                 # x1: Link register
   li x2, 0x1800 # x2: Stack pointer, at the end of 6 kb

   li x3, IO_BASE
   li x9, IO_BASE + IO_UART_CNTL

   li x4, 0

1: # sw x4, IO_LEDS(x3)

   # Wait for busy flag being cleared
2: lw x5, 0(x9)

     srli x10, x5, 8
     sw x10, IO_LEDS(x3)

   andi x5, x5, 0x200 # Bit 9: Busy
   bne x5, zero, 2b

   sw x4, IO_UART_DATA(x3)

   # Small delay
   li x5, 0x1
3: addi x5, x5, -1
   bne x5, zero, 3b # When commenting out this jump, the program stops working.

   # Next character
   addi x4, x4, 1
   j 1b
Mecrisp commented 3 years ago

How I make it:

memmap:


MEMORY
{
   rom(RX)   : ORIGIN = 0x00800000, LENGTH = 0x400
}

SECTIONS
{
   .text : { *(.text*) } > rom
}

Assemble:


riscv64-linux-gnu-as blinky.s -o blinky.o -march=rv32i
riscv64-linux-gnu-ld -o blinky.elf -T memmap blinky.o -m elf32lriscv
riscv64-linux-gnu-objdump -Mnumeric -D blinky.elf > blinky.list
riscv64-linux-gnu-objcopy blinky.elf blinky.bin -O binary
BrunoLevy commented 3 years ago

Thank you very much for the update, Yes, I haven't tested yet exec from SPI very much, so there is probably a couple of bugs that remain (fixed a big one yesterday) I will probably need to write a simulator for the flash spi to be able to see what's going on. -- B

BrunoLevy commented 3 years ago

@olofk thank you very much for this trick, I love it !

BrunoLevy commented 3 years ago

@olofk Wonderful, this saved me an additional 20 LUTs !

Mecrisp commented 3 years ago

Hi Bruno,

I wish you an enjoyable new year !

The bug with the blinky code above is fixed in your latest commit, and I started porting Forth to FemtoRV. To see what happens, I inserted LED patterns at some locations, and it seems as if it hangs forever in this loop which is designed to skip a routine, searching for the ret opcode at the end:

  li x14, 0x00008067 # Ret-Opcode

1:lw x15, 0(x8)
  addi x8, x8, 4
  bne x15, x14, 1b

I am, however, not sure if the loop fails, or if the initial value in x8 is already wrong at the beginning. As the codebase of Mecrisp-Quintus is running on other RISC-V processors nicely, I suspect there is still a bug in execution from Flash memory.

Are you able to compile your C demos to run directly from SPI flash, are there known issues yet ?

Matthias

BrunoLevy commented 3 years ago

Hi Matthias, After a big pass of reorg of the IO-space and HDMI for the ULX3S, I will work again on exec-from-spi and keep you updated. Best wishes (and happy new year :-) -- B

Mecrisp commented 3 years ago

Hi Bruno,

something you might find interesting: Here is a SDRAM controller specifically made for the ULX3S and with a bus interface designed for PicoRV32.

https://github.com/rxrbln/picorv32/blob/master/picosoc/sdram.v

Some files are missing in the repository, the project cannot be synthesised as-is and the bugtracker is deactivated, but I already contacted the author via E-Mail.

Have fun with clocks and gates :-) Matthias

Mecrisp commented 3 years ago

Reply from author:

Hi,

done:

https://www.youtube.com/watch?v=YoILfUAmwjU https://www.youtube.com/watch?v=YoILfUAmwjU

https://github.com/rxrbln/picorv32 https://github.com/rxrbln/picorv32

Mit freundlichen Grüßen, René Rebe

BrunoLevy commented 3 years ago

Hi Matthias, I'm now working on exec from SPI. I have tried a couple of simple programs, it seems to work (but it does not prove that there is no bug !) Different things that need care:

Then one problems remain: sdata segments (initialized RW), do you have some in mecrip ?

BrunoLevy commented 3 years ago

P.S. since I do not use fast SPI modes, it is super slow (maybe 32 times slower than exec from BRAM). We'll probably need a small instr cache...

BrunoLevy commented 3 years ago

Something else that I noted: starting execution from NRV_RESET_ADDR does not always work (I probably still have a bug in the processor), so for now I'm using FIRMWARE/ASM_EXAMPLES/jump_to_spi_flash.S