Write Stalls in Cheri vs Mainline Piccolo

abukharmeh commented 3 years ago

Hi,

I am evaluating the ability of CHERI Piccolo core to run unmodified RISC-V code, and noticed that CHERI Piccolo takes 6 cycles to perform any type of store instructions (sb,sw ...), while the mainline piccolo core takes 2 cycles.

In these added 4 cycles, it appears that the pipeline stage one is busy as reported in the verbose log (e.g running with -v2) for these cycle: Output_Stage1 BUSY (fetch BUSY)

I am wondering if this is known, and what is the root cause of these added stalls ?

Kind regards, Ibrahim.

abukharmeh commented 3 years ago

I had a bit of time to look into this, and it seems related to the difference between cache implementation in CHERI Piccolo versus mainline Piccolo

PeterRugg commented 3 years ago

Sorry I missed this until now. This looks very interesting: I'll hopefully have time to look into it tomorrow. Thanks for bringing it to our attention!

PeterRugg commented 3 years ago

Hmm, I've taken a look, and couldn't seem to reproduce this. Is there any chance you could give more info, including the code that triggered it, the configuration you built (RV32ACIMUxCHERI?), the addresses accessed etc. Stage1 stalling on particular types of instructions is strange because Stage1 should only stall waiting on instruction fetch, so it shouldn't really mind what the instruction types are. I'm keen to find out more about this!

abukharmeh commented 3 years ago

Hi, Yes I am using RV32ACIMUxCHERI, but also just tried it on RV32CIMUxCHERI, I am seeing the same results.

It looks like for the first couple (9-10) of writes , they happen in 2 cycles, but after that, they start taking 6 cycles each. Following is a picture of a sequence that would trigger this !

But the writes are repeated couple hundred times.....

I looked at the cache implementation and it looks like its write around in both cases, so that should not result in any change ! I don't think this is writing to any specific location outside the normal memory, as SOC map reports !

cache_dw_valid$whas signal appears to be getting delayed after the first couple of writes if that helps, and I tried to track it further, there is a chain of 2FIFOs that getting filled.

It looks like the same thing is happening with just any normal program, but this just shows the sequence that triggers it. At the moment I am trying to understand how the tag controller work, and if it could potentially affect the performance, even for non CHERI programs !

abukharmeh commented 3 years ago

PS: Also how do you work on development of Piccolo, is there any tool that I am missing, do you use BDW by any chance ? I am using just BSC and traditional verilog sims to debug this, but its pain because of the transpiled code, everything loses its structure and noinline won’t always work on all modules due to BSV restrictions !

abukharmeh commented 3 years ago

Although it looks like that these first 9 10 writes are happening in 2 cycles matching mainline, AW and W FIFOs in AXI are filled in the first two writes, and then it takes a bit of time for the stalls to propagate through the FIFOs through the bus network to make the effect observable !!

jrtc27 commented 3 years ago

Hi, Yes I am using RV32ACIMUxCHERI, but also just tried it on RV32CIMUxCHERI, I am seeing the same results.

It looks like for the first couple (9-10) of writes , they happen in 2 cycles, but after that, they start taking 6 cycles each. Following is a picture of a sequence that would trigger this !

But the writes are repeated couple hundred times.....

I looked at the cache implementation and it looks like its write around in both cases, so that should not result in any change ! I don't think this is writing to any specific location outside the normal memory, as SOC map reports !

cache_dw_valid$whas signal appears to be getting delayed after the first couple of writes if that helps, and I tried to track it further, there is a chain of 2FIFOs that getting filled.

It looks like the same thing is happening with just any normal program, but this just shows the sequence that triggers it. At the moment I am trying to understand how the tag controller work, and if it could potentially affect the performance, even for non CHERI programs !

I've managed to reproduce this locally and believe it to be an artefact of the simulation testbench.

In the testbench there is a deburster sitting in front of the memory controller model, as the memory controller does not handle burst transactions, and Bluespec's own deburster is able to have multiple outstanding requests (and treats read and write channels independently, though that's not relevant here), so everything is nicely pipelined.

However, our own deburster is simpler and only supports a single outstanding request (though it's not obvious from reading it, on first glance it appears to have FIFOs for state but it's implicitly throttled by the use of a mkSerialiser which doesn't go to IDLE and accept a new request until the old one has completed). By my count it takes 6 cycles for a request to go through the deburster, reach memory and come back (1 to move to the output AXI master FIFO of the deburster, 1 to move to the input AXI slave FIFO of the memory controller, 1 to move to the memory controller's internal request FIFO, 1 to process the request and put a response in its output AXI slave FIFO, 1 to move to the deburster's input AXI slave FIFO, 1 to get processed), and you can see the slow rate at which the back of the tag controller sees memory responses come back with the +tagcontroller plusarg (which turns on a bit of other debug output too). So Piccolo and the tag controller are fine, just thwarted by a slower testbench SoC that quickly asserts backpressure; note that neither deburster is used when synthesising for FPGA.

Doing any kind of performance analysis in simulation is pretty meaningless anyway, the memory "controller" and "model" are rather crude; it's basically just a single-line writeback cache (that acts as a way to adapt from the 64-bit AXI bus to the 256-bit memory width) stuck in front of a big bank of 256-bit registers. Even on FPGA the mismatch between soft-core clock speeds and DRAM latency means that performance is still a bit dodgy, but at least it's real memory components.

jrtc27 commented 3 years ago

PS: Also how do you work on development of Piccolo, is there any tool that I am missing, do you use BDW by any chance ? I am using just BSC and traditional verilog sims to debug this, but its pain because of the transpiled code, everything loses its structure and noinline won’t always work on all modules due to BSV restrictions !

We tend to use tactfully-placed prints (and have shims we can insert for things like AXI that log every transfer); there are various debug prints scattered throughout the source, some from us, some from Bluespec, generally guarded behind various verbosity settings. Trying to debug at the Verilog level does indeed tend to suck when you want to look at internal signals, it only really works when debugging signals that cross synthesis boundaries (i.e. inputs or outputs to modules marked with (* synthesize *)), or, if debugging on FPGA, when inserting your own ILA probes. Printing also has the advantage that you can use things like fshow to quickly dump out entire structs rather than having to grab all the fields manually (or, worse, reconstructing them from a bitvector). We also generally prefer using Bluesim over Verilator given all our debugging is done at the BSV level rather than needing the compiled Verilog.

abukharmeh commented 3 years ago

However, our own deburster is simpler and only supports a single outstanding request

Hi Jessica, Thank you very much for tracking this down. Would you please elaborate more on why the deburster was changed from the mainline one ?

Thanks, Ibrahim.

jrtc27 commented 3 years ago

However, our own deburster is simpler and only supports a single outstanding request

Hi Jessica, Thank you very much for tracking this down. Would you please elaborate more on why the deburser was changed from the mainline one ?

Thanks, Ibrahim.

Bluespec's AXI library in Piccolo/Flute/Toooba enforces that all the xUSER fields have the same width (and then their Piccolo sets it to 0). We use the RUSER and WUSER fields to carry tags, but have no use for the ARUSER, AWUSER and BUSER fields, so replaced their components with our own AXI library (https://github.com/CTSRD-CHERI/BlueStuff/tree/master/AXI) that has them individually controllable, which includes our own deburster.

abukharmeh commented 3 years ago

Regarding Bluesim, when we started working with Piccolo, we tried it however it looked like it was not recording all signals when exporting to VCD. Do you know if there is a parameter or an argument that specifies the depth of VCD logging in Bluesim

jrtc27 commented 3 years ago

Never tried it. But a quick search turns up https://github.com/B-Lang-org/bsc/issues/236 which sounds like it could be what you're seeing.

CTSRD-CHERI / Piccolo

Write Stalls in Cheri vs Mainline Piccolo #18