WFI: System can't be reset via DebugPlugin

xobs commented 4 years ago

If you're using the DebugPlugin to debug a core, and it issues a WFI with no interrupts enabled, then that core will freeze and be unrecoverable with the DebugPlugin.

The only solution is to power-cycle the system.

Dolu1990 commented 4 years ago

ahh sure, that's right, hmm so, when a debugplugin halt happen, the wfi should be waked up. i think that should be enough to fix the issue. I will fix that tonight, then can you test again ?

xobs commented 4 years ago

Yes, that makes sense.

By the way, this was discovered when we were porting Linux to the Hackaday conference badge. For some reason, an atomic instruction was taking two cycles to execute (I.e. stepi halted twice with the same pc), didn't call the trap handler, and was effectively a noop. This led the Kernel startup to think it was cpu core #30000000, and wait to be woken up.

Are atomics implemented, or is there possibly another interaction with the Debug plug in and invalid instructions? Why would it execute the same pc twice?

On 19 November 2019 17:40:31 GMT+08:00, Dolu1990 notifications@github.com wrote:

ahh sure, that's right, hmm so, when a debugplugin halt happen, the wfi should be waked up. i think that should be enough to fix the issue. I will fix that tonight, then can you test again ?

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/SpinalHDL/VexRiscv/issues/98#issuecomment-555418731

Dolu1990 commented 4 years ago

weird, what was the cpu configuration ? It happen when you were doing step by step ?

xobs commented 4 years ago

Same configuration as last time. It's using linuxFull() as the CSR config.

xobs commented 4 years ago

https://github.com/xobs/VexRiscv-verilog/blob/master/src/main/scala/vexriscv/GenHaD.scala

Dolu1990 commented 4 years ago

So

, an atomic instruction was taking two cycles to execute

Basicaly, when the d$ do a miss, it refill itself and reschedule the instruction, but in debug that rescheduling isn't possible. So when you do a step on a memory operation which load stuff, you have no guarenties that the step will be successfull. This is expected behaviour, not a bug/issue. So first dummy step behave as a line refill + NOP, second step do the real thing.

If you have the mmu enabled, that can be even worst, MMU miss + cache miss => 2 NOP before a successfull LD

As far i know, when using GDB and doing a step, if the PC didn't move, he redo a step automaticaly.

Anyway, stepbystep on atomic instruction was never tested, the best would probably to run a simulation and capture the wave to figure out what is going wrong.

xobs commented 4 years ago

gdb will read the $pc in order to know where it is. That way it can do things like trace the interrupt handler when an interrupt occurs. With Vex, it seems like that the stepIt bit will simply not advance $pc when it stalls like that. Which makes sense.

Does Vex currently support atomic instructions? Particularly amoadd.w a3, a2, (a3)? Or is that supposed to be patched and handled by trap_handler?

Dolu1990 commented 4 years ago

Does Vex currently support atomic instructions?

Yes, the D$ support all of them. (https://github.com/xobs/VexRiscv-verilog/blob/master/src/main/scala/vexriscv/GenHaD.scala#L93) what you can do to get a better diagnostic is to reduce the interraction between the debug and the hardware execution by using breakpoint instead of step.

As breakpoints will use the normal pipeline behaviour

mithro commented 4 years ago

FYI -- @pgielda @mwelling @pdp7

Dolu1990 commented 4 years ago

I updated the VexRiscv dev branch to wakeup the pipeline when the debug ask to halt the CPU : https://github.com/SpinalHDL/VexRiscv/commit/7ae218704e0e1849395e9bd64f58f34ce391b88d

@xobs Can you give a test ?

Dolu1990 commented 4 years ago

Just tested the WFI, that seem all fine now.

Now i'm checking atomic

Dolu1990 commented 4 years ago

@xobs

Tested the atomic stuff, i got that double step execution, but as said before, that's to be expected. But the execution (all in step by step) was fine.

_start:

    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    la a3, memA
    li a2, 1
    amoadd.w a3, a2, (a3)
    nop
    nop
    nop
    la a3, memA
    li a2, 1
    amoadd.w a3, a2, (a3)
    nop
    nop
    nop
    nop

memA:
.word 0x10
memB:
.word 0x20

xobs commented 4 years ago

Sorry for the noise -- the other issue was completely unrelated and appears to be a bug in Linux where it assumes memory is cleared before it clears the BSS.

Specifically:

https://github.com/sifive/riscv-linux/blob/master/arch/riscv/kernel/head.S#L41-L45

This uses atomics to increment a value. If the value is 0, then that core considers itself to have "won the lottery" and boots as normal. If it's nonzero, then it assumes that some other core won the lottery and issues a WFI.

But if memory isn't zeroed out prior to running that code, then unless you have ~4 billion cores it's not guaranteed that any core will be 0.

mithro commented 4 years ago

@xobs I'm guessing the clear should be sent upstream to the RISC-V Linux?

xobs commented 4 years ago

Testing by having blockram at address 0x10000000, and loading the following binary. Commands reproduced here so I can remember them in the future -

wfitest.S:

.global _start
_start:
    csrw sie, zero
    ebreak
    li t0, 0
    addi t0, t0, 1
    addi t0, t0, 1
    addi t0, t0, 1
    wfi
    addi t0, t0, 1
    addi t0, t0, 1
    addi t0, t0, 1
    ebreak
iloop:
    j iloop

$ riscv64-unknown-elf-gcc -ggdb3 "-Wl,--section-start=.text=0x10000000" -march=rv32ima -mabi=ilp32 wfitest.S -o wfitest.elf -nostdlib -nostdinc; riscv64-unknown-elf-objcopy -O binary wfitest.elf wfitest.bin
$

Without 7ae218704e0e1849395e9bd64f58f34ce391b88d the wfi locks up the CPU as described in the bug report. With 7ae218704e0e1849395e9bd64f58f34ce391b88d it behaves like a nop which is as expected.

xobs commented 4 years ago

@xobs I'm guessing the clear should be sent upstream to the RISC-V Linux?

I'm not sure how you'd do that. They make an assumption that the memory is zeroed, which is how the whole lottery works. The right way might be to have hart_lottery placed at a known offset so the bootloader knows to zero it. Or to require that memory is zeroed. Or to use some other method for determining which core you're on.

A bug should definitely be logged, I'm just not sure how to do that, and I haven't had good experience getting things upstream into Linux.

SpinalHDL / VexRiscv

WFI: System can't be reset via DebugPlugin #98