Open jeras opened 2 years ago
Hi @jeras, thank you very much for the feedback, I'll double check and test. Best, -- Bruno
Internal state register aluShamt
is missing a reset, causing simulation to fail on WAIT_ALU_OR_MEM
state.
I did not analyze what a random value in aluShamt
would do.
On FPGA this would not affect power-on reset, since all registers are initialized to 0
, but it could cause issues if the CPU is reset in the middle of a shift.
ASIC would be affected.
x0
is not initialized to zero, which again would be fine for FPGA, but not for an ASIC.
Some FPGA tools allow you to provide the initial value of a register, which would work for both synthesis and simulation.
An ASIC would need either a register reset, or another mechanism.
The instr
register does not have a reset, which probably should not be an issue.
A possible issue is mixing registers with and without reset in the same always
statement.
This mixing means, the reset
signal will be used to implement the instr
logic.
I am not sure what the effect would be with a synchronous reset, on an ASIC with an asynchronous reset, this would add a multiplexer at the input into instr
with reset as the select signal and a loop from instr
output to one of the inputs into the mux. This would assure instr
would keep its value while reset is active.
So by splitting the always statement into one with reset for state
and PC
and one for the other registers might reduce logic consumption a bit.
A related issue is descibed in this document section "3.1 Synchronous reset flip-flops with non reset follower flip-flops": http://www.sunburst-design.com/papers/CummingsSNUG2003Boston_Resets.pdf
Thank you for the observations and hints! Are you going to produce an ASIC with the Quark in it? The Quark is designed around using BRAMs as its register set, which would be implemented with actual registers and muxes that gives hardwired zero for x0. Also absolutely go for a barrel shifter in ASIC! In terms of size, the Quark-with-barrel-shifter still fits in HX1K FPGA, and it is a real gain. The official Quark does not have it because we needed every possible LUT for adding peripherals, but beyond the constraints of the tiny HX1K FPGA, much better performance/area ratio can be achieved.
By the way, if I were about to design an ASIC, I would shoot for some very cool analog chip with special optical features :-) Digital design can be done with a vanilla FPGA next to it.
First I must say I noticed some initialization code at the end of the RTL under a BENCH ifdef
I did not notice before, this should handle many simulation issues I mentioned above.
I still think the reset value of state
is problematic.
aluShamt
seems to relay on the reset value of state
being WAIT_ALU_OR_MEM
and multiple clock cycles of active reset to clear to zero. This would at least have to be documented, but would be much easier to just add a reset.
I do not plan to make an ASIC with quark
, although I have seen some synthesis attempts:
https://bitlog.it/20220118_asic_roundup_of_open_source_riscv_cpu_cores.html
I needed a very small RISC-V implementation to compare against mine, since I have some issues making my instruction decoder small due to an experimental coding style.
Before you continue reading please note, I did not go through the episodic design notes yet. I will try to get through before asking too many further questions. An open a separate issue if I will have further questions or suggestions.
@Mecrisp, you mentioned quark was designed to use BRAM on FPGA (if I understood the sentence correctly). But the RTL is written like a FlipFlop implementation with a write port and 2 asynchronous read ports (data multiplexers). A short name for this port configuration is 2R1W. Yesterday I compiled it with Vivado for the Arty board and apparently a BRAM block was used for the register file.
I am perplexed how BRAM was inferred from the given RTL. The Artix BRAM has true dual port support, 2 independent read/write ports in short 2RW. So one port would have to be shared between a read and a write, which could not be done simultaneously. Since this sharing is not explicit in the RTL, I am not sure how the synthesis tool was able to infer a BRAM. In my experience inference is a very delicate process often resulting in suboptimal implementations.
I only tried to synthesize for Lattice HX1K using your makefiles (with yosys), and there it is even less clear to me if and how a BRAM was used. The HX1K ports are 1W1R and the largest data width configuration is 16, so at least 4 blocks would be needed to implement a 1W2R 32-bit register file. On the other hand if the register file was synthesized into Flip-Flops, this would be 1024 FF or about 80% of the available flops on the smallest device.
If (when) I would start a CPU design for HX1K, I would just store the register file in the main memory (probably the end), this would result in the smallest logic consumption. I think a similar approach might be used in 8-bit AVR, since load/store instructions can access the GPRs. I was also looking at GOWIN FPGA devices, which have distributed RAM with asynchronous read support, which is a better fit for my current CPU design, for now I am synthesizing for Artix and Cyclone V.
When aluShamt has a non-zero value after coming fresh from a short reset pulse, shifting is active, but as the processor starts in WAIT_ALU_OR_MEM state, it will perform up to 31 dummy shift cycles and then resume operation. I know, this is not the most elegant solution, but we were aggressively optimising for minimum LUT usage, and therefore we added reset initialisations only when absolutely necesssary. BENCH section will cover your needs.
Most of the paths we explored during design are described in https://github.com/BrunoLevy/learn-fpga/issues/1
https://bitlog.it/20220118_asic_roundup_of_open_source_riscv_cpu_cores.html
Cool!
There is no asynchronous read of the register file, reads are actually clocked as well. See this snipplet of the relevant lines copied from Quark source:
reg [31:0] rs1;
reg [31:0] rs2;
reg [31:0] registerFile [31:0];
always @(posedge clk) begin
if (writeBack)
if (rdId != 0)
registerFile[rdId] <= writeBackData;
...
state[WAIT_INSTR_bit]: begin
if(!mem_rbusy) begin // may be high when executing from SPI flash
rs1 <= registerFile[mem_rdata[19:15]];
rs2 <= registerFile[mem_rdata[24:20]];
@Mecrisp, you mentioned quark was designed to use BRAM on FPGA (if I understood the sentence correctly).
Correct.
We are specifying a 2R1W implementation on a FPGA that only offers 1R1W block RAMs in hardware. In order to achieve 32 bit wide access, actually 4 blocks of memory are used on HX1K for register file.
Like you, we were totally surprised by the marvellous RAM interference capabilities Yosys has to offer.
Keeping the register file in memory is possible; the SERV core is working that way, but it decreases performance when compared to a separate register file. The Quark as-is needs 3 cycles for regular and 5 cycles for load/store instructions (up to 34 cycles for shifts); Quark-Bicycle achieves 2 cycles for regular instructions and 4 cycles load/store with a barrel shifter and additional address bus multiplexers in execute state. Cycles given when running from RAM, no memory busy.
Thanks for all the precious details.
I still need to read through the SERV documentation, I would like to understand how much is serialized, since probably it is not everything (instruction decoder?).
There is probably still some middle ground between SERV and quark which is worth exploring. I think it should be possible to get a CPI around 3~5 while storing instruction/data and registers in the same memory, which is close to the current quark. But memory utilization would be much better, so some memory could be available to peripherals, or there would just be more memory for CPU instruction/data.
I had a look at the Quark-Bicycle source code. I will definitely borrow the idea of sharing a single shifter for left/right.
I apologize in advance for the following rant, my social skills are not good enough for me to be educational but not condescending.
Code where every bit is coded individually rubs me the wrong way (it took me 15 min just to find an appropriate phrase): https://github.com/BrunoLevy/learn-fpga/blob/master/FemtoRV/RTL/PROCESSOR/femtorv32_quark_bicycle.v#L131-L137 While SystemVerilog provides "11.4.14 Streaming operators (pack/unpack)" which can be used for bit reversal, I have never encountered them used in RTL. A good Verilog-2005 alternative is to use a function with a for loop, something like the Verilog equivalent of this SystemVerilog code:
function logic [XLEN-1:0] reverse (logic [XLEN-1:0] value);
for (int i=0; i<32; i++) reverse[i] = value[XLEN-i-1];
endfunction: reverse
This function can be then used for both reversal operations. The same approach is great for Gray encoder/decoder.
I have a special disdain for generated CRC code. When the same can be achieved with totally generic XOR and a shift on a vector inside a loop. https://bues.ch/cms/hacking/crcgen I wanted to give you and example, I have written many, but never for an open source project, so I have nothing public yet. The following is an example I found online, but it has an extra loop compared to my code which applies XOR directly to vectors (data and polynomial): https://github.com/hellgate202/crc_calc/blob/master/src/crc_calc.sv
How will you handle memory busy signals when you aim for similiar CPI but with registers in main memory?
Yesterday I wrote a draft spec for a single memory CPU: https://github.com/jeras/rp32/tree/master/hdl/rtl/r5p_1mem At the end is a short comment on handling memory busy with a bus based on tightly integrated memory interfaces (also in early draft state): https://github.com/jeras/rp32/blob/master/doc/Sysbus.md There are some waveforms available, but they are still in wavedrom json format. Read (1 cycle fixed delay for read data): Write:
I find all standard low resource buses inadequate. Actually the only option is APB and it is really bad, it requires 2 clock cycles for each read/write without any wait states. The basic idea of my proposal is to handle busy in the first clock period of a transfer for both read and write. Instead of providing busy for a read in the same cycle as read data. Read data is than available after a fixed delay, for simple systems (like the proposed processor) this delay is 1 clock cycle. APB peripherals would have to be rewritten, for now I have only written a GPIO.
You could try a combinatorial piece of logic for short-circuiting the wait for busy/ready signaling depending on the address range. Bruno inserted a short-circuit for sw/sh/sb working this way; if you have asynchronous read of the register file, you might do similiar for fetch. Quark is currently limited to a minimum of 2 CPI per instruction because the register file needs one cycle after the opcode arrives.
I already took advantage off all possible combinational bypasses in the draft design. The longest phase sequence is 4 (fetch, rs1 read, rs2 read, write-back for most instructions), and all are memory accesses. So at least for 1RW memories this sequence is optimal. For dual port memories it would be possible to combine at least some memory accesses and have a shorter sequence.
I made this design with ASIC in mind. On an ASIC you can usually choose what kind of memories you would like to use, and single port memories are about 30% smaller than dual port. So on ASIC a processor without a GPR register file would not make much sense with dual port memories.
It would probably be possible to create some king of pipeline for quark with at least some instructions executing in a single clock cycle, but I am not sure it would be worth the effort.
Do all processors in the FemtoRV32 family use a synchronous GPR register file?
My original R5P processor design has a CPI of 1 for all instructions. This requires an asynchronous GPR register file. So my focus was on FPGA devices with 1RW1R distributed memories where the 1R port can be asynchronous. Some of this FPGA are Artix, EPS5, Cyclone 5, GOWIN, ... I only started thinking about a CPU without a register file after talking to you and thinking, it could be very small in an ASIC.
As I understand cost was the main reason for focusing on Lattice HX1K, did you consider low cost solutions from GOWIN and other low cost vendors? https://www.dialog-semiconductor.com/products/greenpak/low-power-low-cost-forgefpga https://www.quicklogic.com/products/fpga/fpgas-sram/
I personally prefer the major vendors due to mature tools, but with improvements in open source tools low cost FPGA might become more attractive.
Great! I'll keep an eye on your ongoing development efforts.
Do all processors in the FemtoRV32 family use a synchronous GPR register file?
At this point in time, yes.
As I understand cost was the main reason for focusing on Lattice HX1K, did you consider low cost solutions from GOWIN and other low cost vendors?
We are strong FOSS advocates here, and Project Apicula for Gowin synthesis wasn't available back then. It all started with Lattice HX1K.
I personally prefer the major vendors due to mature tools, but with improvements in open source tools low cost FPGA might become more attractive.
I do FPGA firmware development for aerospace as my main job, using commercial tools, and I still strongly prefer Yosys.
I tried to run a simulation of
femtorv32_quark.v
using the Vivado simulator, because I my SoC gets past synthesis well, but gets minimized to nothing during implementation, I do not know what is going on.So I noticed the memory bus address is
x
after reset is removed. This would not be a problem with Verilator, but in a 4-state simulatorx
is likely to be propagated through the design. In my case the CPU does not change state with clock. I went looking at what is on the address bus at reset: https://github.com/BrunoLevy/learn-fpga/blob/master/FemtoRV/RTL/PROCESSOR/femtorv32_quark.v#L217-L218 https://github.com/BrunoLevy/learn-fpga/blob/master/FemtoRV/RTL/PROCESSOR/femtorv32_quark.v#L335 And the state is initialized toWAIT_ALU_OR_MEM
while stateFETCH_INSTR
sounds like the right choice to have PC on the address bus at reset.So I made this change, without checking if there are any other non obvious expectations. Verilator simulations got the same results as before and in Vivado simulation the CPU actually started fetching instructions.
Could you please review this change with your insight and maybe modify similar issues in other implementations.
The morale of the story is, Verilator 2-state simulation can sometimes hide some issues, expecially if don't care values are often used in RTL. My RTL is full of them, so I have to simulate with
--x-assign
argument set to both0
and1
.