chipsalliance / Cores-VeeR-EH1

VeeR EH1 core
Apache License 2.0
810 stars 219 forks source link

Speculative load observed on LSU AXI #78

Closed pieter3d closed 3 years ago

pieter3d commented 3 years ago

I'm playing with a program on the EH1 that ends up issuing a speculative data load, despite the PRM saying that EH1 does not do this.

Code snippet:

; Function finishing up
lw s11, 12(sp)
lw s10, 16(sp)
; ... snip ...
lw s0, 56(sp)
lw ra, 60(sp)
addi sp, sp, 64
ret
; Some other function
lw a4, 4(a1)

I see the load after the ret go out, but it doesn't show up in the trace as expected since it isn't supposed to execute. Also a4 is correctly left untouched as expected. The issue is that the load happens on AXI, which is slow and causes a sizable bubble.

agrobman commented 3 years ago

could you, please, provide more details: 1) what platform are you running? (provide description of environment where the EH1 is installed; what is EH1 configuration? 2) snapshot of waves; if simulated 3) register values before the snippet; mrac, mfdc values, etc 4) will be nice if the problem can be shown on testbench, provided in this repo, plus test snippet source.

pieter3d commented 3 years ago

1 - Linux, running vcs simulator. EH1 in a relatively small design, bare metal code (no RTOS). Cache disabled, DCCM and ICCM only (both 32kB). IFU AXI interface tied off, code is loaded in through DMA port, then core is reset for boot. 2 - Will work on getting this. 3 - mrac = 0xaaaaaaaa (all regions marked as having side effect). mfdc = 0x8. Branch predict disabled. Note that enabling it has the same behavior. 4 - Will see if I can get this put together.

In general though - is this something that I should expect? Or would you agree something fishy is going on?

pieter3d commented 3 years ago

Here's the wave dump. You can see the large stall caused by the AXI load, even though that's a speculative load that doesn't even show up in the trace. snip.vcd.gz

agrobman commented 3 years ago

this may be effect of branch predictor, but we need to take closer look.. At least we may need to change PRM wording ...

pieter3d commented 3 years ago

The AXI read is not from the IFU, it's from the LSU. Also this happens with branch prediction disabled too.

It reads with address = a1 + 4, where a1 just has some unrelated data value in it, thus a bogus address. The corresponding memory region is marked as having side effects so I would expected that kind of prefetch to have been suppressed. If this is expected behavior, it seems to me that you could always get arbitrary and unpredictable reads based on whatever code is running, which could potentially cause lots of problems.

aprnath commented 3 years ago

Hi pieter3d, please indicate which version of SweRV you are using. In version 1.8, there was a bug fix for a speculative load to region marked non-idempotent. See the release notes for version 1.8.

pieter3d commented 3 years ago

I am using version 1.6. I tried with version 1.8, and indeed this load no longer shows up. Thanks for the pointer!

agrobman commented 3 years ago

CPU always try to prefetch code. Without branch predictor it prefetches forward, placing instructions to decode stage. once CPU executes taken branch (in later execution stage), it discards/flushes all following instructions and their effects.
decode stage may send load instructions to LSU for execution, which eventually may go to the bus speculatively. 1.8 fixes this problem for side-effect loads, but 'normal' loads still may go to the bus speculatively with somewhat random addresses. If some memory locations should be protected from CPU accesses EH1 'MPU' feature can be used. (DATA_ACCESS_ENABLE, DATA_ACCESS_ADDR DATA_ACCESS_MASK*)

Seems we need change PRM wording about LSU speculation ...

pieter3d commented 3 years ago

The MPU features are compile time static though right? That couldn't be used to work around the LSU issue in pre-1.8 silicon for example.

agrobman commented 3 years ago

yes, these are compile time defines. They could solve this problem, if you are not using LSU bus - system does not have external data memory. Or you have external data 'normal' memory, and you don't want to have these speculative accesses to go to some other locations (for ex, if these won't provide response). mrac settings are somewhat solve your problem, but have pretty course granularity ... and will make your external memory(if you have some) relatively slow ..