Networking broken for CheriBSD on Toooba GFE bitfiles

jonwoodruff commented 3 years ago

Toooba GFE bitfiles on the vcu118 are able to boot, perform "dhclient xae0", and then ping a Linux machine, and ssh into that machine. I believe that CheriBSD images from ~2 months ago were able to do the same thing.

Recent CheriBSD images succeed in getting an IP address using "dhclient xae0", but either ping causes a kernel panic. I've previously seen a panic with scp, and I will attempt again with scp and ssh and give the reports later. The ping crash is here:

[PEXPECT\PROMPT]>dhclient xae0 DHCPDISCOVER on xae0 to 255.255.255.255 port 67 interval 4 DHCPOFFER from 192.168.0.1 DHCPREQUEST on xae0 to 255.255.255.255 port 67 DHCPACK from 192.168.0.1

bound to 192.168.0.51 -- renewal in 432000 seconds. [PEXPECT\PROMPT]> [PEXPECT\PROMPT]>ping 192.168.0.111 PING 192.168.0.111 (192.168.0.111): 56 data bytes ra = 0xffffffc0004f2ae6 sp = 0xffffffc0124e2690 [rwxRW,0x0000000000000000-0xffffffffffffffff] gp = 0xffffffc0124e24f0 [rwxRW,0x8000000000000000-0x7fffffffffffffff] (invalid,sealed) tp = 0x0000010433e60c38 [,0x000001011c000000-0x0000010800000000] (invalid) t[0] = 0x8000000000000009 t[1] = 0xffffffc0004ed5f4 [rwxRW,0x0000000000000000-0xffffffffffffffff] t[2] = 0x0000000000006000 t[3] = 0x0000000000000000 t[4] = 0xffffffd00386d080 t[5] = 0x0000000000000000 t[6] = 0xffffffd003d056c0 s[0] = 0xffffffc0124e26b0 s[1] = 0xffffffc003872000 s[2] = 0xffffffc0124e2720 s[3] = 0x0000000000000000 s[4] = 0x0000000000000000 s[5] = 0x0000000000000001 s[6] = 0x0000000000000001 s[7] = 0x0000000000000000 s[8] = 0xffffffd003a0be00 s[9] = 0xffffffd00396ace0 s[10] = 0xffffffd0039b9800 s[11] = 0xffffffd00396acf4 a[0] = 0x0000000000000000 a[1] = 0xffffffc013a00000 a[2] = 0x0000000000201004 a[3] = 0xffffffc013c01004 a[4] = 0xffffffc0004e9380 a[5] = 0x00000000000003e8 a[6] = 0x000000000000000a a[7] = 0x0000000000000433 sepc = 0xffffffc0004e9384 [rwxRW,0x0000000000000000-0xffffffffffffffff] ddc = 0x0000000000000000 [rwxRW,0x0000000000000000-0xffffffffffffffff] sstatus == 0x0000000200004100 stval == 0xffffffc013c01004 panic: Memory access exception at 0xffffffc0004e9384

cpuid = 0 time = 1606046320 KDB: stack backtrace: db_trace_self() at db_trace_self db_fetch_ksymtab() at db_fetch_ksymtab+0x170 kdb_backtrace() at kdb_backtrace+0x2c vpanic() at vpanic+0x146 panic() at panic+0x42 do_trap_supervisor() at do_trap_supervisor+0xdc cpu_exception_handler_supervisor() at cpu_exception_handler_supervisor+0xe2 --- exception 5, tval = 0xffffffc013c01004 generic_bs_r_4() at generic_bs_r_4+0x4 intr_irq_handler() at intr_irq_handler+0x74 riscv_cpu_intr() at riscv_cpu_intr+0x70 do_trap_supervisor() at do_trap_supervisor+0x70 cpu_exception_handler_supervisor() at cpu_exception_handler_supervisor+0xe2 --- interrupt 9 spinlock_exit() at spinlock_exit+0x3c mi_switch() at mi_switch+0xfe db_dump_intr_event() at db_dump_intr_event+0x816 fork_exit() at fork_exit+0x60 fork_trampoline() at fork_trampoline+0xa KDB: enter: panic

jonwoodruff commented 3 years ago

Trace for SSH

[PEXPECT\PROMPT]>dhclient xae0 xae0: link state changed to UP DHCPDISCOVER on xae0 to 255.255.255.255 port 67 interval 6 DHCPOFFER from 192.168.0.1 DHCPREQUEST on xae0 to 255.255.255.255 port 67 DHCPACK from 192.168.0.1 bound to 192.168.0.51 -- renewal in 432000 seconds. [PEXPECT\PROMPT]>ssh 192.168.0.111 ra = 0xffffffc0004f2ae6 sp = 0xffffffc0124e2690 [rwxRW,0x0000000000000000-0xffffffffffffffff] gp = 0xffffffffffffffff [,0xfffffffffffffe89-0xffffffffffffff9e] (invalid,sealed) tp = 0xffffffd0039b9600 [rwxRW,0xffffffd0039b96b8-0xffffffd0039bae68] (invalid,sealed) t[0] = 0x8000000000000009 t[1] = 0xffffffc0004ed5f4 [rwxRW,0x0000000000000000-0xffffffffffffffff] t[2] = 0x0000000000006000 t[3] = 0x0000000000000000 t[4] = 0xffffffd00386d080 t[5] = 0x0000000000000000 t[6] = 0xffffffd003d58840 s[0] = 0xffffffc0124e26b0 s[1] = 0xffffffc003872000 s[2] = 0xffffffc0124e2720 s[3] = 0x0000000000000000 s[4] = 0x0000000000000000 s[5] = 0x0000000000000001 s[6] = 0x0000000000000001 s[7] = 0x0000000000000000 s[8] = 0xffffffd003a0be00 s[9] = 0xffffffd00396ace0 s[10] = 0xffffffd0039b9800 s[11] = 0xffffffd00396acf4 a[0] = 0x0000000000000000 a[1] = 0xffffffc013a00000 a[2] = 0x0000000000201004 a[3] = 0xffffffc013c01004 a[4] = 0xffffffc0004e9380 a[5] = 0x00000000000003e8 a[6] = 0x000000000000000a a[7] = 0x00000000000003ee sepc = 0xffffffc0004e9384 [rwxRW,0x0000000000000000-0xffffffffffffffff] ddc = 0x0000000000000000 [rwxRW,0x0000000000000000-0xffffffffffffffff] sstatus == 0x0000000200000100 stval == 0xffffffc013c01004 panic: Memory access exception at 0xffffffc0004e9384

cpuid = 0 time = 1606047900 KDB: stack backtrace: db_trace_self() at db_trace_self db_fetch_ksymtab() at db_fetch_ksymtab+0x170 kdb_backtrace() at kdb_backtrace+0x2c vpanic() at vpanic+0x146 panic() at panic+0x42 do_trap_supervisor() at do_trap_supervisor+0xdc cpu_exception_handler_supervisor() at cpu_exception_handler_supervisor+0xe2 --- exception 5, tval = 0xffffffc013c01004 generic_bs_r_4() at generic_bs_r_4+0x4 intr_irq_handler() at intr_irq_handler+0x74 riscv_cpu_intr() at riscv_cpu_intr+0x70 do_trap_supervisor() at do_trap_supervisor+0x70 cpu_exception_handler_supervisor() at cpu_exception_handler_supervisor+0xe2 --- interrupt 9 spinlock_exit() at spinlock_exit+0x3c mi_switch() at mi_switch+0xfe db_dump_intr_event() at db_dump_intr_event+0x816 fork_exit() at fork_exit+0x60 fork_trampoline() at fork_trampoline+0xa KDB: enter: panic

jrtc27 commented 3 years ago

You're getting a load access fault reading from the PLIC (specifically the claim register for this hart). This is likely the first time an external interrupt has fired; have you done anything that might break the PLIC (either in the core or the device tree)?

jonwoodruff commented 3 years ago

@jrtc27 Is there a reason why broken hardware or the device tree would work for riscv64 but not for CheriBSD kernels?

arichardson commented 3 years ago

I seem to also be getting this for a RISC-V non-CHERI kernel:

[PEXPECT\PROMPT]>ifconfig xae0 up
xae0: link state changed to UP
[PEXPECT\PROMPT]>ifconfig
xae0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=80008<VLAN_MTU,LINKSTATE>
    ether 00:0a:35:04:db:5a
    media: Ethernet autoselect (1000baseT <full-duplex>)
    status: active
    nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
    options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
    inet6 ::1 prefixlen 128
    inet6 fe80::1%lo0 prefixlen 64 scopeid 0x2
    inet 127.0.0.1 netmask 0xff000000
    groups: lo
    nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
[PEXPECT\PROMPT]>dhclient xae0
DHCPDISCOVER on xae0 to 255.255.255.255 port 67 interval 7
ra = 0xffffffc00053247e
sp = 0xffffffc019b98618
gp = 0x000000000000000b
tp = 0xffffffc0136c4b00
t[0] = 0x8000000000000009
t[1] = 0x0000000000000120
t[2] = 0x0000000000000000
t[3] = 0x0000000000010000
t[4] = 0xffffffd002292e00
t[5] = 0x0000000000000000
t[6] = 0x0000000000000532
s[0] = 0xffffffc019b98638
s[1] = 0xffffffc0026a0000
s[2] = 0xffffffc019b986b8
s[3] = 0xffffffc001b39778
s[4] = 0x0000000000000000
s[5] = 0xffffffc019b989d2
s[6] = 0xffffffd0022b3d00
s[7] = 0x0000000000000000
s[8] = 0x000000000000000e
s[9] = 0x0000000000000000
s[10] = 0x0000000000000000
s[11] = 0x000000000060bf20
a[0] = 0x0000000000000000
a[1] = 0xffffffc012a00000
a[2] = 0x0000000000201004
a[3] = 0xffffffc012c01004
a[4] = 0xffffffc000529b30
a[5] = 0x0000000000000001
a[6] = 0x0000000000000002
a[7] = 0xffffffc001bd1478
sepc = 0xffffffc000529b34
sstatus == 0x8000000200006100
stval == 0xffffffc012c01004
panic: Memory access exception at 0xffffffc000529b34

cpuid = 0
time = 1608126967
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_fetch_ksymtab() at db_fetch_ksymtab+0x170
kdb_backtrace() at kdb_backtrace+0x2c
vpanic() at vpanic+0x144
panic() at panic+0x42
do_trap_supervisor() at do_trap_supervisor+0xee
cpu_exception_handler_supervisor() at cpu_exception_handler_supervisor+0x68
--- exception 5, tval = 0xffffffc012c01004
generic_bs_r_4() at generic_bs_r_4+0x4
intr_irq_handler() at intr_irq_handler+0x8e
riscv_cpu_intr() at riscv_cpu_intr+0x94
do_trap_supervisor() at do_trap_supervisor+0x70
cpu_exception_handler_supervisor() at cpu_exception_handler_supervisor+0x68
--- interrupt 9
witness_unlock() at witness_unlock+0x234
__mtx_unlock_flags() at __mtx_unlock_flags+0x4a
xdma_queue_submit() at xdma_queue_submit+0x5c
uart_cpu_getdev() at uart_cpu_getdev+0x24b4
uart_cpu_getdev() at uart_cpu_getdev+0x1da0
ether_output_frame() at ether_output_frame+0xc4
ether_output() at ether_output+0x640
bpfdetach() at bpfdetach+0x14d0
dev2udev() at dev2udev+0x298a
user_writev() at user_writev+0x11c
sys_writev() at sys_writev+0x62
do_trap_user() at do_trap_user+0x236
cpu_exception_handler_user() at cpu_exception_handler_user+0x72
--- exception 8, tval = 0
KDB: enter: panic

Kernel:

FreeBSD 13.0-CURRENT #0 r368114+baf17e99fa11-c336125(dev): Wed Dec 16 12:39:25 GMT 2020
    alr48@waltham.cl.cam.ac.uk:/local/scratch/alr48/cheri/build/cheribsd-riscv64-build/local/scratch/alr48/cheri/cheribsd/riscv.riscv64/sys/GFE riscv
clang version 11.0.0 (https://github.com/CTSRD-CHERI/llvm-project 77fcf8f6a601331b226349414e78d8499216e17c)
WARNING: WITNESS option enabled, expect reduced performance.

jrtc27 commented 3 years ago

ra = 0xffffffc0004f2ae6
sp = 0xffffffc01abb6420 [rwxRW,0x0000000000000000-0xffffffffffffffff]
gp = 0xffffffc01abb6280
tp = 0xffffffd0039d9a40
t[0] = 0x8000000000000009
t[1] = 0xffffffc0004ed5f4 [rwxRW,0x0000000000000000-0xffffffffffffffff]
t[2] = 0x0000000000000001
t[3] = 0xffffffd00386d0d0
t[4] = 0xffffffd00386d080
t[5] = 0x0000000000000000
t[6] = 0x0000000000000649
s[0] = 0xffffffc01abb6440
s[1] = 0xffffffc003872000
s[2] = 0xffffffc01abb64b0
s[3] = 0x0000000000000000
s[4] = 0xffffffd0038fe000
s[5] = 0xffffffc01abb6802
s[6] = 0xffffffd003a33c00
s[7] = 0x0000000000000000
s[8] = 0x000000000000000e
s[9] = 0x0000000000000000
s[10] = 0x0000000000000000
s[11] = 0x0000000000000070
a[0] = 0x0000000000000000
a[1] = 0xffffffc013a00000
a[2] = 0x0000000000201004
a[3] = 0xffffffc013c01004
a[4] = 0xffffffc0004e9380
a[5] = 0xffffffc0004e93d0
a[6] = 0xffffffd0039b7a20
a[7] = 0x0000000004000000
sepc = 0xffffffc0004e9384 [rwxRW,0x0000000000000000-0xffffffffffffffff]
ddc = 0x0000000000000000 [rwxRW,0x0000000000000000-0xffffffffffffffff]
sstatus == 0x0000000200000100
stval == 0xffffffc013c01004
panic: Memory access exception at 0xffffffc0004e9384

cpuid = 0
time = 1606046205
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_fetch_ksymtab() at db_fetch_ksymtab+0x170
kdb_backtrace() at kdb_backtrace+0x2c
vpanic() at vpanic+0x146
panic() at panic+0x42
do_trap_supervisor() at do_trap_supervisor+0xdc
cpu_exception_handler_supervisor() at cpu_exception_handler_supervisor+0xe2
--- exception 5, tval = 0xffffffc013c01004
generic_bs_r_4() at generic_bs_r_4+0x4
intr_irq_handler() at intr_irq_handler+0x74
riscv_cpu_intr() at riscv_cpu_intr+0x70
do_trap_supervisor() at do_trap_supervisor+0x70
cpu_exception_handler_supervisor() at cpu_exception_handler_supervisor+0xe2
--- interrupt 9
uart_cpu_getdev() at uart_cpu_getdev+0x1eaa
ether_output_frame() at ether_output_frame+0xaa
ether_output() at ether_output+0x622
bpfdetach() at bpfdetach+0x13b4
dev2udev() at dev2udev+0x28a0
user_writev() at user_writev+0x120
sys_writev() at sys_writev+0x66
do_trap_user() at do_trap_user+0x19a
cpu_exception_handler_user() at cpu_exception_handler_user+0xde
--- exception 8, tval = 0
KDB: enter: panic

and if I use my GDB scripts to go and translate that stval:

(gdb) vatopa 0xffffffc013c01004
pa 0xc201004 RWAD

which looks fine, that's the claim register in the second context, i.e. hart 0's S-mode claim register (with the first context being hart 0's M-mode context). Something about your bitstream is broken.

jonwoodruff commented 3 years ago

For what it is worth, this kernel works:

FreeBSD 13.0-CURRENT #0 r362121+89083b90c601-c327974(multicompat)-dirty: Thu Sep 3 15:52:35 BST 2020

Something changed in recent kernels to tickle this bug.

jrtc27 commented 3 years ago

Could be the commit that deferred completing to a later point, as previously asynchronous interrupt handlers would cause the complete to happen after scheduling the handler, not after it actually finished. But the PLIC really should not care about that kind of thing.

commit 56fd757880311e4f1c906bac7a83aafbbdf3f222
Merge: 49d26ef9a57d 2c83536931a0
Author:     kp <kp@FreeBSD.org>
AuthorDate: Mon Jul 6 21:29:50 2020 +0000
Commit:     Brooks Davis <brooks@one-eyed-alien.net>
CommitDate: Wed Nov 11 21:51:32 2020 +0000

    riscv plic: Do not complete interrupts until the interrupt handler has run

    We cannot complete the interrupt (i.e. write to the claims/complete register
    until the interrupt handler has actually run. We don't run the interrupt
    handler immediately from intr_isrc_dispatch(), we only schedule it for later
    execution.

    If we immediately complete it (i.e. before the interrupt handler proper has
    run) the interrupt may be triggered again if the interrupt source remains set.
    From RISC-V Instruction Set Manual: Volume II: Priviliged Architecture, 7.4
    Interrupt Gateways:

    "If a level-sensitive interrupt source deasserts the interrupt after the PLIC
    core accepts the request and before the interrupt is serviced, the interrupt
    request remains present in the IP bit of the PLIC core and will be serviced by
    a handler, which will then have to determine that the interrupt device no
    longer requires service."

    In other words, we may receive interrupts twice.

    Avoid that by postponing the completion until after the interrupt handler has
    run.

    If the interrupt is handled by a filter rather than by scheduling an interrupt
    thread we must also complete the interrupt, so set up a post_filter handler
    (which is the same as the post_ithread handler).

    Reviewed by:    mhorne
    Sponsored by:   Axiado
    Differential Revision:  https://reviews.freebsd.org/D25531

(that's it being merged from FreeBSD into CheriBSD)

jrtc27 commented 3 years ago

Hm, except this is a load not a store and it's inside plic_intr so it's doing the claim not the complete, which hasn't changed.

jrtc27 commented 2 years ago

This was fixed by https://github.com/CTSRD-CHERI/Toooba/commit/ac6244181a531b86a8f5f54e5b98ac85e97d64f1 (and its upstreamed equivalent)

CTSRD-CHERI / cheribsd

Networking broken for CheriBSD on Toooba GFE bitfiles #863