eclipse / kuksa.val

kuksa.val
Apache License 2.0
95 stars 51 forks source link

databroker crashing on startup on RISC-V #735

Closed g-scott-murray closed 6 months ago

g-scott-murray commented 7 months ago

Since upgrading AGL to Rust 1.70, we're seeing the databroker coredump shortly after startup on RISC-V platforms, e.g.:

[   60.200585] do_trap: 14 callbacks suppressed
[   60.201494] tokio-runtime-w[294]: unhandled signal 11 code 0x1 at 0x0000000000000000 in databroker[2ac6dc8000+42e000]
[   60.204470] CPU: 1 PID: 294 Comm: tokio-runtime-w Not tainted 5.15.124-yocto-standard #1
[   60.205157] Hardware name: riscv-virtio,qemu (DT)
[   60.205595] epc : 0000002ac708aa52 ra : 0000002ac7088bf6 sp : 0000003f81539280
[   60.206199]  gp : 0000002ac72519f8 tp : 0000003f8153a8e0 t0 : ac7b53ea80000000
[   60.231919]  t1 : 0000000000000403 t2 : 00000000001866d6 s0 : 0000000000000000
[   60.232447]  s1 : 0000000000000000 a0 : 0000000000000000 a1 : 0000002ac7263a60
[   60.232940]  a2 : 0000000000000000 a3 : 0000000000000000 a4 : 0000000000000000
[   60.233475]  a5 : 0000000000000000 a6 : 0000000000000400 a7 : 0000000000000000
[   60.234218]  s2 : 0000003f815393f8 s3 : 0000003f815393f0 s4 : 0000003f815393d0
[   60.234751]  s5 : 0000002ac7263090 s6 : 0000002ac72630d0 s7 : 0000000000000003
[   60.235285]  s8 : 0000000000000001 s9 : 0000002ac7264320 s10: 0000002ac7263a40
[   60.235843]  s11: 0000000000000001 t3 : 403d000000000000 t4 : 000a6d3b0c000000
[   60.236393]  t5 : ffffffffffffffff t6 : 00000000246148bb
[   60.236773] status: 0000000000004020 badaddr: 0000000000000000 cause: 000000000000000d

When I do a debug build to try to get a backtrace, it starts working. I tried disabling all clients, disabling TLS, and disabling JWT authorization, and the issue still happens. Setting RUST_LOG=debug, I do not get any messages after the expected startup stuff, i.e.:

...
Feb 13 20:19:32 qemuriscv64 databroker[425]: 2024-02-13T20:19:31.184249Z  WARN databroker: Authorization is not enabled.
Feb 13 20:19:32 qemuriscv64 databroker[425]: 2024-02-13T20:19:31.185514Z  INFO databroker::broker: Starting housekeeping task
Feb 13 20:19:32 qemuriscv64 databroker[425]: 2024-02-13T20:19:31.186361Z  INFO databroker::grpc::server: Listening on 127.0.0.1:55555
Feb 13 20:19:32 qemuriscv64 databroker[425]: 2024-02-13T20:19:31.186695Z  INFO databroker::grpc::server: TLS is not enabled
Feb 13 20:19:32 qemuriscv64 databroker[425]: 2024-02-13T20:19:31.186890Z  INFO databroker::grpc::server: Authorization is not enabled.
[  358.726304] tokio-runtime-w[428]: unhandled signal 11 code 0x1 at 0x0000000000000000 in databroker[2ab2faf000+42e000]
[  358.727901] CPU: 1 PID: 428 Comm: tokio-runtime-w Not tainted 5.15.124-yocto-standard #1
[  358.728559] Hardware name: riscv-virtio,qemu (DT)
...

The crash happens 20-30s after starting on my qemuriscv64 setup, but I'm not sure if that's indicative of a codepath triggered by a timeout, or the QEMU emulation. I'll experiment tomorrow with bumping past the 0.4.2 tag commit to pick up the dependency update that was done, but I am opening this issue now at Sebastian's request just in case others are seeing this.

g-scott-murray commented 6 months ago

I can confirm that this seems to have gone away with newer Rust (1.75). Do you want to close, or is there interest in trying to pin down which version fixed it? I'm in the midst of getting ready for Embedded World demos, so that might take a while if desired.

SebastianSchildt commented 6 months ago

Thanks for reporting back.

Rust is moving fast, and I think we do not win much, trying to pin down the exact issue.

I will close for now, let's just "remember". If somebody feels different, or has addtional comments, feel free to reopen