Open nwf-msr opened 9 months ago
If it's dying in loader that's an upstream bug, we don't turn anything CHERI on there. Can you please verify it's reproducible with 14.0-RELEASE and/or recent 15-CURRENT?
Yes, this is an upstream regression. Bisection (full sequence of tests below) points at 75b7d39e ("stand: efi_fmtdev can be reduced to devformat") being the culprit, though at a glance it's hard to see why.
git bisect bad 525ecfdad597980ea4cd59238e24c8530dbcd31d
git bisect good fc952ac2212b121aa6eefc273f5960ec3e0a466d
git bisect good 8824cbace389c440394bb9ea6c127d0f8f85538b
git bisect good 3fc5f6b0abf5d5e57a7f171a87b1fb7de63b096f
git bisect good 1c915de99108487f58f338ddf1287bda489cb73f
git bisect bad efbf827ae461a66402acb191c036623ce1ee608f
git bisect good 16ccda32623a98cf4a2ab3804b5ae6277c01395e
git bisect bad 03c44354fc9e82f89d6e323d5f20ab953854812d
git bisect good 09a31487c1bdcb5dc8780fdd7e78540d686a3854
git bisect good 9b2453430a8dc49d34be7560ad71ff568b973672
git bisect good 5c505b9d8166d997015d3e841846b5222375a1e7
git bisect good ec6d9023e56aebe0c3fe8fb7d15512097a20df8a
git bisect bad 26e5e3ecbd61d82b369a0ce6a83297e95526a691
git bisect good e9db6f25b5a6bb9ca76351c88720cca2ffcb6604
git bisect bad e36bd9b6cffb13f2ff50bd3551ff20188d5b8099
git bisect bad 75b7d39e116f79563ea3d8d18f9bf3141f89a712
It's a bit odd that you've hit the MFC commit to stable/13 rather than the one to main...
I will say it's not clear those functions are equivalent except in the simplest cases and the commit doesn't explain why only those cases can happen.
devformat should produce exactly the same results. If not, it's a bug in the dev->d_dev formatting routine (which should default to the default: case for http booting. devformat uses devsw->dv_fmtdev if it exists, and defaults to
snprintf(name, sizeof(name), "%s%d:", d->d_dev->dv_name, d->d_unit);
if not, which is the same as the default case which was removed i 75b7d3...
- default:
- sprintf(buf, "%s%d:", dev->d_dev->dv_name, dev->d_unit);
- break;
I think only DEVT_NONE type devices are different.
Since only a few devices in the devsw have fmtdev, they should be teh same. disk has a different one, and zfs has a different one as well....
And I know others have network booted.
One could revert that one change, and then unrevert the replaced calls to efi_fmtdev one at a time (there's only 3) to see which one goes south, and what the devdesc that's passed into devformat() function looks like.
Also, it would be nice to get a symbolic traceback on what's happening. I have some libtraceback code written, but it doesn't quite work so I've not pushed it into Upstream FreeBSD... but knowing which of these calls to efi_fmtdev dies might suffice.
Manually symbolizing...
Synchronous Exception at 0x00000000F327C740
PC 0x0000F327C740
PC 0x0000F7D310DC bi_load
PC 0x0000F7D3A498 elf64_exec
PC 0x0000F7D3ACE8 command_boot
PC 0x0000F7D709D0 lua_perform
PC 0x0000F7D4D374 luaD_precall
PC 0x0000F7D6555C luaV_execute
PC 0x0000F7D4D5CC luaD_callnoyield
PC 0x0000F7D4C56C luaD_rawrunprotected
PC 0x0000F7D4DB54 luaD_pcall
PC 0x0000F7D499C8 lua_pcallk
PC 0x0000F7D46770 interp_include
PC 0x0000F7D3C76C command_include
PC 0x0000F7D707C8 lua_command
PC 0x0000F7D4D0A0 luaD_pretailcall
PC 0x0000F7D673A8 luaV_execute
PC 0x0000F7D4D5CC luaD_callnoyield
PC 0x0000F7D4C56C luaD_rawrunprotected
PC 0x0000F7D4DB54 luaD_pcall
PC 0x0000F7D499C8 lua_pcall
PC 0x0000F7D468A0 interp_run
PC 0x0000F7D3C4C0 interact
PC 0x0000F7D34124 main
PC 0x0000F7D320A4 efi_main
PC 0x0000F7D3A550 _start
In particular,
; getrootmount(devformat(rootdev));
10d8: 3c 7c 01 94 bl 0x601c8 <devformat>
10dc: 06 26 00 94 bl 0xa8f4 <getrootmount>
but 0x0000F327C740
looks like it's somewhere else in EFI and so the trail runs pretty cold, though I think it's safe to say that we hit the d->d_dev->dv_fmtdev != NULL
case. Oddly, looking at the rest of the EFI debug messages, I don't see anything getting loaded that far south.
So the exception is at 0x0000F327C740, which is well outside the range of all the other addresses on the stack. But it has survived the call to devformat...
I'm not sure what the device should be... but one test would be to set vfs.root.mountfrom to what you think it should be before issuing 'boot'. This would let us know if it was the return value (and/or the function itself) that's causing this or something else. Worst case, if you set this to something non-sensical, the kernel will just not be able to fine /, so its safe to test to see if the return value from devformat() changes in a way that runs of off the cliff we hit.
I'm not familar with cheri enough to know, but are we running in capabilities which trap on buffer overflows, or are we still in mixed mode where they might be possible?
loader is built as a plain AArch64 binary, no capabilities allowed, since the firmware doesn't save/restore them currently on traps so they would be clobbered at arbitrary points. We only enable capability use in the kernel's locore.
OK, so we can't rule out an overflow or something similar....
So the exception is at 0x0000F327C740, which is well outside the range of all the other addresses on the stack. But it has survived the call to devformat...
I don't think that's right; the return address is 0x0000F7D310DC, which means that bi_load
is still "in" its call to devformat
, which has (possibly transitively) tail-called to 0x0000F327C740.
With apologies, I hadn't tested netbooting this release on the MSRC cluster. It looks like we've got a regression. Using the https://github.com/microsoft/msr-morello-automation tooling using HTTP netboot, 2022.12 boots fine, but 2023.11 (built locally) dies at the very end of
loader.efi
with the below wad of complaints. I have tested on both the 1.5 and 1.7 releases of Morello firmware, with no apparent difference. Fortunately,loader_lua.efi
from 2022.12 continues to function, so that's pretty convenient as workarounds go.