The CPU is slow - Githubissues

mo-hosni commented 1 year ago

After working on an AES accelerator to be integrated in the user_project_wrapper in caravel, we noticed that the CPU takes around 9000 clock cycles to do a single read or write of a 32-bit register in the user_project_wrapper. The project can be found here: https://github.com/mo-hosni/caravel_aes_example/tree/main

Such simple C code here: https://github.com/mo-hosni/caravel_aes_example/blob/main/verilog/dv/cocotb/encipher_test/encipher_test.c takes fractions of a second on 40MHz clock (real time not simulation).

I think there is something wrong here. Or is that normal?

RTimothyEdwards commented 1 year ago

What part of it is slow? Where are the routines like read_user_double_word and write_user_double_word defined? The wishbone interface is blocking, so if a process does not return an acknowledge, it can hold the processor indefinitely until it does. I assume the write_user_ and read_user_ routines are driving the wishbone bus to the user project, so you need to figure out when the user project is returning an acknowledge signal for the write or read, and whether it is spending the time in between doing expected computation.

M0stafaRady commented 1 year ago

I debugged this example with @mo-hosni. We found out the wishbone interface isn't blocking the simulation the ack for any user write or read is asserted after 1 cycle. We found out that fetching the instructions is the biggest blocker. Reading the instruction is done via SPI interface. Since the spi flash is 4x slower than the core clock we estimated best case for reading 1 instruction is 32(bits) * 4 (core clocks ) = 128 core clock cycles ignoring times of sending commands and address and any delay between clocks. Also the CPU doesn't have a big cache so it reads some of the instructions more than once. From simulation, I can see the CPU is blocked while fetching the instruction.

I added new test called encipher_test_faster which is exactly the same as the encipher_test but the test increases the flash_clk to be only 2x slower than core clock by changing the clk_div value.	Test name	SIM TIME (ns)
encipher_test	10393850.00
encipher_test_faster	5400650.00

The simulation time was approximately halved, as indicated in the table. That's why we think fetching instructions into the CPU using SPI is the biggest contributor to the big simulation needed.

CC: @shalan

dlmiles commented 10 months ago

@mo-hosni In my own use of Caravel on Arty A7, there is an issue with the LiteX VexRiscV CPU iBus & dBus wishbone arbitrator. In that the iBus (when fetching from a slow resource, that is SPI flash) will starve the dBus write-back, because the CPU will always trying to feed itself instructions.

While I agree the comment above from @M0stafaRady and the number of cycles is inline with my findings as well. I think the problem @mo-hosni is experiencing is due to the LiteX component not committing a single write-transaction in a timely manner to their user_project_wrapper hardware device.

One resolution is to force a CPU STALL after an important write-back that needs to be committed over WishBone ASAP, this can be done with minimal side-effects with the instruction lw zero,(zero). The stall forces the CPU to stop demanding to be fed via iBus which is the only way for the arbitrator to re-assess the grant status and switch it to allow the dBus write-back to proceed (followed by the extra lw zero,(zero)) before it starts demanding instructions from iBus again. This is entirely normal and expected Vex behaviour but the LiteX adapter provided in Caravel doesn't accommodate in the embedded environment.

When the iBus is being fed by faster memory, such as single-cycle DFF/UserManagedCacheArea, the CPU is slower than the instruction memory this causes there to be natural breaks in iBus demand during such things as reg-write-back so you will not observe the problem as much. So copying and executing your code to DFF memory 0x0 area is another method to alleviate this.

Even with QSPI the Flash XIP will not reliability keep up with the CPU, the pipeline would remain mostly empty, taking more cycles to fetch 32bit over QSPI clocked at core_clock/2 or worse.

Ideally with Flash XIP and the VexRiscV CPU you want to use the independence of the iBus from the dBus feature so the data path from SPI/XIP controller is separate from the WishBone when the iBus address is in the expected Flash image range. An independent and parallel data path.

// A verilog surgery fix might look like this (working for me in FPGA,
//  it is totally untested in any other environment and
//  a better solution is to fix the LiteX module directly and rebuild Caravel):
diff --git a/verilog/rtl/mgmt_core.v b/verilog/rtl/mgmt_core.v
index 4a51f5a..b09a476 100644
--- a/verilog/rtl/mgmt_core.v
+++ b/verilog/rtl/mgmt_core.v
@@ -2392,7 +2414,14 @@ always @(*) begin
        endcase
 end
 assign mgmtsoc_litespimmap_spi_dummy_bits = mgmtsoc_litespimmap_storage;
+// | (mgmtsoc_dbus_dbus_cyc & mgmtsoc_dbus_dbus_we & litespi_grant == 1'd0)
+//    litespi_grant == 1'd0 // trying to only activate if this is MMAP mode
+// This will yield the iBus demand on SPI, if there is an outstanding write request issued on dBus
+//  mgmtsoc_dbus_dbus_stb is tied to cyc, maybe done_and_yield works as well as done_or_yield
+wire mgmtsoc_litespimmap_ibus_yield_grant;
+assign mgmtsoc_litespimmap_ibus_yield_grant = (mgmtsoc_dbus_dbus_cyc & mgmtsoc_dbus_dbus_we & litespi_grant == 1'd0);
 assign mgmtsoc_litespimmap_done = (mgmtsoc_litespimmap_count == 1'd0);
+wire mgmtsoc_litespimmap_done_or_yield;
+assign mgmtsoc_litespimmap_done_or_yield = mgmtsoc_litespimmap_done | mgmtsoc_litespimmap_ibus_yield_grant;
 always @(*) begin
        litespi_next_state = 4'd0;
        litespi_next_state = litespi_state;
@@ -2525,7 +2554,7 @@ always @(*) begin
                4'd8: begin
                end
                default: begin
-                       mgmtsoc_litespimmap_burst_cs_litespi_next_value0 = (mgmtsoc_litespimmap_burst_cs & (~mgmtsoc_litespimmap_done));
+                       mgmtsoc_litespimmap_burst_cs_litespi_next_value0 = (mgmtsoc_litespimmap_burst_cs & (~mgmtsoc_litespimmap_done_or_yield));
                end
        endcase
 end
@@ -7135,7 +7244,7 @@ always @(posedge sys_clk) begin
        state <= next_state;
        case (grant)
                1'd0: begin
-                       if ((~request[0])) begin
+                       if ((~request[0] | mgmtsoc_litespimmap_ibus_yield_grant)) begin
                                if (request[1]) begin
                                        grant <= 1'd1;
                                end else begin

// Workaround for such sequences:
//   my_device_register = 0x1234cdef;
//   CPU_STALL_WORKAROUND();
//   for(volatile int i = 0; i < 10000; i++) { ASM_NOP_DELAY(); }
//  /* do something else */

#define CPU_STALL_WORKAROUND()                       \
    /* CPU_STALL_WORKAROUND():                       \
     * Used to induce Caravel Vex RiscV CPU stall to \
     *  enforce dBus write-back in a timely manner.  \
     */                                              \
    __asm__ __volatile__ ("lw zero,(zero)"           \
                  : /* none */                       \
                  : /* none */                       \
                  : /* clobbers: none */ )

mithro commented 10 months ago

FYI - The author of VexRISCV, @Dolu1990, might be interested in this thread.

Dolu1990 commented 10 months ago

Does somebody has a simulation wave of the slow behaviour ?

M0stafaRady commented 10 months ago

@Dolu1990 This vexrisc_run.tar.gz has wave as well as some logs and files that might me helpful.

To extract: tar -xzvf vexrisc_run.tar.gz

Dolu1990 commented 10 months ago

@M0stafaRady Thanks :)

So, looking at it i see nothing wrong, just that the CPU is waiting nearly all the time on the i$ to refill, as the external SPI flash is slow, but in the range of those 128 cycles per 32 bits word. Not sure if the issue is captured in that wave ?

soc.gtkw.txt

Dolu1990 commented 10 months ago

What is the instruction cache configuration ? (bytes / ways)

RTimothyEdwards commented 10 months ago

@Dolu1990 : FYI, the instruction cache is tiny---just a few words. I've complained before that the VexRISC is a poor choice for this microcontroller because we are trying to minimize the amount of layout area used by the processor. It's not that the VexRISC is a bad processor; it's that we don't have the resources to provide it to make it run efficiently.

dlmiles commented 10 months ago

From my testing (on Arty A7) there does not appear to be a cache, only a CPU working buffer at the frontend of the instruction decoder.

If you execute a branch/jump backwards of just 2 instructions (targeting a 16 byte aligned address), the iBus will demand the PC be reloaded (via Flash XIP) for exactly the same data it already just retired even when the next instruction (branch target) address does not cross the current aligned block. It is difficult to be able to call this a cache as there is no observable cache effect. It behaves like the Instruction Decoder input working buffer from my observations.

The CPU will speculatively load the next instruction (PC+4), even in a scenario where a branch will occur, the pipeline is empty (because CPU is faster than SPI Flash), while the branch instruction is running (branch insn has to work through that pipeline), the iBus will speculatively ask for PC+4 in the background. The Flash XIP will service this slower than the retirement of that branch instruction. Once speculative insn is provided (and discarded) the iBus asks for the target of the branch instruction, which will cause Flash XIP to reset SPI CS (Chip Select) to issue a new READ command from a new address, all adding to the total time to execute the next insn.

There is 1.5 KiB of DFF RAM at memory 0x0 and this operates as single-cycle memory, which might be better called UserManagedCacheSpace. This is able to feed the CPU iBus to allow the internal CPU pipeline to take effect, the CPU is now slower than the single-cycle memory.

IDD with @RTimothyEdwards comments, a pipelined CPU is great when you can feed that pipeline, even QSPI Flash at core_clock speeds would not be able to feed the (5 stage) Vex. So I understand the point being made on the choice of using area up to provide that many pipeline stages that you can not feed from the architecture around it.

As stated the Vex is performing as expected, my previous comment concerns the LiteX provided mgmt_core.v and an expectation in the embedded environment for CPU write-back (especially to MMIO hardware) to be performed in a timely manner, even if the program enters a NOP style delay loop immediately after performing that memory-write, which is common with low MHz embedded systems.

The matter is compounded over the lack of WFI support built into the choice of configuration for the CPU (see Issue#36), these things taken together with slow Flash means the iBus can never be idle, except at a CPU STALL condition, once the iBus has the grant of the WishBone it will not release it until the iBus can be made idle. The LiteX arbitrator (specifically designed for the Vex) just needs to manage things knowing that to achieve a better outcome.

I don't think there is any disagreement in any comment on this issue with @M0stafaRady over the speed and timings found.

What the issue is about is the quoted 9000 cycles to perform a single write (CPU store) to memory (to a MMIO register in the user_project area).

I can explain the conditions on how that might occur and a workaround, in firmware (if working with original verilog), or in verilog (which removes the need to workaround in firmware).

I can not explain how a single read (CPU load) from memory also experiences a 9000 cycle latency, I would be interested also in seeing the waveform demonstrating that.

The 9000 cycle figure is arbitrary, it is possible to write to an internal register, then enter a NOP style delay loop that lasts 10 seconds, then perform a CPU load or store, and finally the write to internal register made 10 seconds ago is committed over WishBone, immediately followed by the recently requested CPU load or store (because the dBus has the grant, it gets to exhaust all the dBus requests before releasing grant), the grant in question is indicated in the 3rd hunk of the diff in my previous comment, mgmt_core.v:7136 verilog net called grant).

mithro commented 10 months ago

@dlmiles / @RTimothyEdwards - Be careful here, VexRISCV is an extremely configurable CPU (https://github.com/SpinalHDL/VexRISCV#description) with a huge number of options which can be selected. It offers 2 stage to 5+ stage variants.

LiteX offers 19 variants as standard (https://github.com/enjoy-digital/litex/blob/9854c9f3228b8c3ff0dd8a66c1afc1f3f3c536a2/litex/soc/cores/cpu/vexriscv/core.py#L31-L49) and the ability to provide a custom variant (which is used by the CFU Playground and Fomu).

Do you know the configuration that was selected for usage in caravel?

mithro commented 10 months ago

BTW There was also a whole bunch of work to create a very fast, very low resource, XIP SPI module for LiteX inside https://github.com/litex-hub/litespi

M0stafaRady commented 10 months ago

@dlmiles I blieve this is the wave @mo-hosni was referring to. I uploaded it to google drive as its size exceeds github limit. https://drive.google.com/file/d/1yO5sBN5kjPEXANofh3gG9NtF7DK6hih_/view?usp=sharing

dlmiles commented 10 months ago

@mithro Throwing links and saying look over there doesn't seem particularly relevant to the points raised (I am certainly aware of the links you have provided). Those links do not change the observable issue by the reporter (or my own findings) with the specific implementation provided inside this github repository. That is why the issue is logged here. That is the point of the issue to collaborate and achieve greater understanding of matters.

I don't think anyone is claiming any efficiency concerns with litespi XIP SPI (performance seems good to me here). I'm not entirely sure where that comment is coming from as the discussion here simply characterises what has been observed with the specific version inside mgmt_core.v provided in this repo.

Just to clarify versions as that is a salient point to raise.

The last comment from me is over the characteristics of the specific Vex version in this repo: https://github.com/efabless/caravel_mgmt_soc_litex/blob/main/verilog/rtl/VexRiscv_MinDebug.v last modified in this repo on https://github.com/efabless/caravel_mgmt_soc_litex/commit/6995a07f296332029a93c660b1ea62c763a54995 (2 years ago, Nov 2021)

It is good to be able to characterise the behaviour of the specific version provided in this repository (provided as a good specimen to use) so that others have an informed understanding of it.

I am going to retest the same method with the https://github.com/efabless/caravel_mgmt_soc_litex/blob/main/verilog/rtl/VexRiscv_MinDebugCache.v (which does have ICache at https://github.com/efabless/caravel_mgmt_soc_litex/blob/main/verilog/rtl/VexRiscv_MinDebugCache.v#L4413 ) this was last modified in this repo Sep 2022 https://github.com/efabless/caravel_mgmt_soc_litex/commit/f44fbcd3f2e10a5a1b03e3a23d33eb955bf2d434

The https://github.com/efabless/caravel_mgmt_soc_litex/blob/main/verilog/rtl/mgmt_core.v last modified https://github.com/efabless/caravel_mgmt_soc_litex/commit/1321c51aeedc19405f5a20b68edc5bdc4fab5c57 (10 months ago, Oct 2022).

These would be the versions in the main branch at this time of this project, I did not find a more recently maintained branch.

@M0stafaRady Thanks for the VCD after quick look at it:

Do you have a time index I should be looking at. What is the hardware device address being accessed that takes 9000 cycles ?

There are 2 main areas where dBus latency is higher than at other times (seen as dBus_CYC becoming active), under the red vertical highlight bar and towards the right of image shown soon after GPIO 1-bit is on (top signal). The image only has around 2/3 of the total VCD but there are no events as congested at a later timeline.

Picking out the example under the red vertical highlight bar, latency for a dBus read looks to be approx ~1479 core_clock cycles (clk) an example is at 1537175ns and the delay is for 36975ns (clk 25ns period). PC=0x10002cfc and it looks to be stack push/pop not an access to a user_project MMIO device register.

This looks to be due to blocking iBus load (8 x 32bit), the 1479 core_clock cycles is in the order of ~11.5 x 32bit words of time to communicate over SPI. 8 being the instructions, 2 x 32bit words of time being due to SPI command/address (the sequence occurs twice in the timeframe) and turnaround time.

Here is the zoomed out view of the area I picked out fro your gcloud download files (I would expect 9000 cycles blocking latency to be visible at this zoom factor).

Screenshot from 2023-08-30 23-17-17_Caravel_Issue129_M0R

mithro commented 10 months ago

@dlmiles - Apologize for the noise -- I was skimming the issue and wrongly concluded that "CPU is slow" was referring to "VexRISCV executing from XSPI" (there has been a lot of iteration in the LiteX community around that topic).

M0stafaRady commented 10 months ago

@mo-hosni Do you remember the time index for the longest writing ?

@M0stafaRady Thanks for the VCD after quick look at it:

Do you have a time index I should be looking at. What is the hardware device address being accessed that takes 9000 cycles ?

There are 2 main areas where dBus latency is higher than at other times (seen as dBus_CYC becoming active), under the red vertical highlight bar and towards the right of image shown soon after GPIO 1-bit is on (top signal). The image only has around 2/3 of the total VCD but there are no events as congested at a later timeline.

Picking out the example under the red vertical highlight bar, latency for a dBus read looks to be approx ~1479 core_clock cycles (clk) an example is at 1537175ns and the delay is for 36975ns (clk 25ns period). PC=0x10002cfc and it looks to be stack push/pop not an access to a user_project MMIO device register.

This looks to be due to blocking iBus load (8 x 32bit), the 1479 core_clock cycles is in the order of ~11.5 x 32bit words of time to communicate over SPI. 8 being the instructions, 2 x 32bit words of time being due to SPI command/address (the sequence occurs twice in the timeframe) and turnaround time.

Here is the zoomed out view of the area I picked out fro your gcloud download files (I would expect 9000 cycles blocking latency to be visible at this zoom factor).

mo-hosni commented 10 months ago

@M0stafaRady, @dlmiles : I just looked at the waveforms again, For example writing data as in this for loop here: https://github.com/efabless/caravel_aes_example/blob/a1331908121e225b8cd6064d06b0bb2f7e87401a/verilog/dv/cocotb/encipher_test/encipher_test.c#L27 The CPU takes 8120 clock cycle between each write to fill the register uut/chip_core/mprj/mprj/aes/core_key.

Dolu1990 commented 10 months ago

Hi @mo-hosni Where is defined USER_writeWord ? Can you send your compiled elf file ?

M0stafaRady commented 10 months ago

@Dolu1990
User_writeWord defined here elf file: decipher_o.zip

Dolu1990 commented 10 months ago

So, looking at the assembly, i can see :

100026d0 <USER_writeWord>:
100026d0:   fe010113            addi    sp,sp,-32
100026d4:   00812e23            sw  s0,28(sp)
100026d8:   02010413            addi    s0,sp,32
100026dc:   fea42623            sw  a0,-20(s0)
100026e0:   feb42423            sw  a1,-24(s0)
100026e4:   fe842783            lw  a5,-24(s0)
100026e8:   00279713            slli    a4,a5,0x2
100026ec:   300007b7            lui a5,0x30000
100026f0:   00f707b3            add a5,a4,a5
100026f4:   fec42703            lw  a4,-20(s0)
100026f8:   00e7a023            sw  a4,0(a5) # 30000000 <_erodata+0x1fffcee0>
100026fc:   00000013            nop
10002700:   01c12403            lw  s0,28(sp)
10002704:   02010113            addi    sp,sp,32
10002708:   00008067            ret

which use 3 block of instruction cache, on that you also have to add the cache blocks used by the loop, not sure how many it is, likely 1 or 2. (1 block is 32 byte, aligned) So, let's assume it's overal 5 blocks => 1200*5 cycles for each iteration (as the instruction cache seems to only have 2 blocks, it overflow), meaning 6000 cycles would be waiting on SPI flash.

I would say, what could be done is just to inline the USER_writeWord function, this should realy speed up things. Also, forcing functions to be 32 bytes aligned may help.

dlmiles commented 9 months ago

So it is the distance (in cycles) between two writes being presented to the hardware over WishBone the 9000 cycle measurement is from. These writes are using an iteration loop https://github.com/efabless/caravel_aes_example/blob/main/verilog/dv/cocotb/encipher_test/encipher_test.c and a function USER_writeWord() with stack saving to perform a single SW (StoreWord) instruction (which is the purpose of the function) to the hardware device at the WishBone user_project address.

From the comment in other threads it has been identified loading instructions costs between 128 and 144 cycles per instruction, even if the CPU can execute and retire that instruction in just a few cycles. It looks like a lot of instructions are used to achieve a goal.

The USER_writeWord function is convenient but.... The fragment of assembler indicates the version you are using is built with compiler options that would:

keep the stack-frame (maybe to assist debugging) when it is possible to remove it (maybe -fomit-frame-pointer can do this)
a large stack frame at that (using 32byte granularity when it is RV32, target platform has embedded RAM limitations and 16 bytes is the ABI requirement ref: 2015/01/riscv-calling.pdf Section 18.2 RVG Calling Convention)
inserts NOP (presumably for a scheduling benefit of some other RiscV CPU arch that is not the Vex) There are obviously some improvement to be made there to reduce both instruction count to implement the method as-is and the excessive use of load/store to achieve the same goal as:

# with the above taken into account may it is possible to have the compiler emit an implementation that looks more like:
.global USER_writeWord
USER_writeWord:
  slli    a1,a1,0x2   # * sizeof(u_int32_t)
  lui     a2,0x30000  # USER_PROJECT_WISHBONE_ADDRESS_BASE = 0x30000000
  add     a1,a1,a2
  sw      a0, 0(a1)
  ret

This would lend itself to inlining as all the writes to hardware can use a common base within RiscV 12-bit offset stores

A version in C that removes the need to use the convenient USER_writeWord() function:

volatile u_int32_t *hardware_base = (u_int32_t) 0x30000000;

const u_int32_t *data = (const u_int32_t *) key;  // as per the single_block_encipher() argument
volatile u_int32_t *hardware_reg_key = (u_int32_t) (hardware_base + (/*offset*/ 0x10 * sizeof(u_int32_t)));
for(int i = 0; i < 8; i++)
  hardware_reg_key[i] = data[i];  // USER_writeWord(data[i], 0x10 + i)

Another option on this platform is to copy the program into the 1536 bytes of DFF RAM at address 0x0. There is a setup cost of doing the copy (during startup), as you'd need to read from flash all the .text data to copy it before execution can begin. So this will not speed up the main() total execution time. But if you split init() and run_test() you could speed up the run_test() part for example if you wanted to performance measure your custom hardware.

Once execution starts (from address 0x000 to 0x600 it should occur at cached speeds). With optimization such as the example USER_writeWord rewrite as above it shoud fit in that space with plenty of room to spare.

Another option maybe you can improve the SPI clock divider ? I think it is core_clock/4 by default, so 10MHz if you are running at 40MHz but maybe /2 or /1 is possible.

efabless / caravel_mgmt_soc_litex

The CPU is slow #129