CTSRD-CHERI / cheribsd

FreeBSD adapted for CHERI-RISC-V and Arm Morello.
http://cheribsd.org
Other
162 stars 59 forks source link

Intermittent "limit"-induced kernel panics #273

Open nwf opened 6 years ago

nwf commented 6 years ago

Occasionally I see something like this on my CheriBSD fork. I don't think any of my changes are relevant, so for the moment will pretend this is an upstream issue.

(limits -c 0 /root/tests/cheri/bin/basic-00090-large.exe  ; echo $? > basic-00090-large.out) || true
panic: vm_fault_hold: fault on nofault entry, addr: 0xc000000004ddb000
time = 1533142345
KDB: enter: panic
[ thread pid 3107 tid 100048 ]
Stopped at      kdb_enter+0x94: break   0
db> bt
Tracing pid 1067 tid 100048 td 0x9800000002066640
kdb_enter+0x94 (?,?,?,?) ra ffffffff803e3544 sp c0000000003ae980 sz 32
vpanic+0x1c4 (?,?,?,?) ra ffffffff803e35f4 sp c0000000003ae9a0 sz 64
panic+0x34 (?,?,?,?) ra 0 sp c0000000003ae9e0 sz 80

Running alltrace at the prompt reveals that it is the limit thread whose stack ends in panic, and that the test itself has apparently already exited. I'm not sure what to make of this; what more would be useful to know?

arichardson commented 6 years ago

I am seeing exactly the same thing when running the libc++ test suite after a few tests have run.

panic: vm_fault_hold: fault on nofault entry, addr: 0xc00000001090f000
time = 1533308008
KDB: enter: panic
[ thread pid 629 tid 100050 ]
Stopped at      kdb_enter+0x94: break   0
db> bt
Tracing pid 629 tid 100050 td 0x9800000004a65640
kdb_enter+0x94 (?,?,?,?) ra ffffffff803e33c4 sp c0000000002ee980 sz 32
vpanic+0x1c4 (?,?,?,?) ra ffffffff803e3474 sp c0000000002ee9a0 sz 64
panic+0x34 (?,?,?,?) ra 0 sp c0000000002ee9e0 sz 80
db>
bsdjhb commented 6 years ago

If we could fix clang to save RA when calling a [[NoReturn]] function then the stack trace would get past panic() and into the actual issue. :( Alternatively, could add a call to kdb_backtrace() before this explicit panic() call in the source to help debug. (BTW, I view this "feature" of LLVM not saving RA for calling [[NoReturn]] as a silly optimization as it breaks things like assert() whose primarily usefulness is using a debugger to get back to the failure)

nwf commented 6 years ago

While that did something, I don't think the result is much more informative:

KDB: stack backtrace:
db_trace_self+0x18 (?,?,?,?) ra ffffffff801e7800 sp c0000000003ae6f0 sz 16
db_fetch_ksymtab+0x238 (?,?,?,?) ra ffffffff80443c9c sp c0000000003ae700 sz 800
kdb_backtrace+0x7c (?,?,?,?) ra ffffffff80659794 sp c0000000003aea20 sz 16
vm_fault_hold+0x2dac (?,?,?,?) ra 0 sp c0000000003aea30 sz 0
panic: vm_fault_hold: fault on nofault entry, addr: 0xc000000004ddb000
time = 1533328761
KDB: enter: panic
[ thread pid 997 tid 100048 ]
Stopped at      kdb_enter+0x94: break   0
arichardson commented 6 years ago

I think this could be related to smbfs. I just ran the libc++ testsuite without smbfs and it completed successfully whereas when I mount the files with smbfs I have always had the job fail due to this panic.