firecracker-microvm / firecracker

Secure and fast microVMs for serverless computing.
http://firecracker-microvm.io
Apache License 2.0
25.03k stars 1.75k forks source link

Some 4.14.x (and other kernels) may crash #636

Closed rn closed 3 years ago

rn commented 5 years ago

It looks like the lack of ACPI in firecracker triggers a kernel bug present in 4.13.x...4.17.x kernels depending on the kernel config used.

This is not really a firecracker issue and I only provide the details here as a FYI in case others hit the same issue.

I've booted a 4.14 kernel on firecracker (and crosvm) with this kernel config and got the following crash:

[    0.700071] device-mapper: ioctl: 4.37.0-ioctl (2017-09-20) initialised: dm-devel@redhat.com
[    0.701125] dmi-sysfs: dmi entry is absent.
[    0.701663] ------------[ cut here ]------------
[    0.702068] kernel BUG at drivers/base/driver.c:153!
[    0.702763] invalid opcode: 0000 [#1] SMP PTI
[    0.703255] Modules linked in:
[    0.703581] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.14.81-linuxkit #1
[    0.703754] task: ffff88001f98edc0 task.stack: ffffc90000028000
[    0.703754] RIP: 0010:driver_register+0x13/0xb4
[    0.703754] RSP: 0000:ffffc9000002beb0 EFLAGS: 00010246
[    0.703754] RAX: ffffffff8230bd60 RBX: ffffffff826a99ea RCX: 0000000000000000
[    0.703754] RDX: ffffffff00000001 RSI: 0000000000000000 RDI: ffffffff8230c100
[    0.703754] RBP: 0000000000000000 R08: 00000000ffffffff R09: 0000000000000000
[    0.703754] R10: ffff88001f83c710 R11: ffff88001dba1165 R12: 00000000ffffffff
[    0.703754] R13: ffffffff826526e8 R14: 0000000000000000 R15: 0000000000000000
[    0.703754] FS:  0000000000000000(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000
[    0.703754] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.703754] CR2: 0000000000000000 CR3: 000000000220a001 CR4: 00000000001606b0
[    0.703754] Call Trace:
[    0.703754]  do_one_initcall+0x9f/0x14f
[    0.703754]  ? do_early_param+0x8e/0x8e
[    0.703754]  kernel_init_freeable+0x169/0x1ea
[    0.703754]  ? rest_init+0xa5/0xa5
[    0.703754]  kernel_init+0xa/0xf0
[    0.703754]  ret_from_fork+0x35/0x40
[    0.703754] Code: ff 48 83 c4 18 5b 5d 41 5c c3 0f 1f 44 00 00 48 8b 7f 70 e9 70 90 d1 ff 0f 1f 44 00 00 48 8b 47 08 48 83 b8 90 00 00 00 00 75 02 <0f> 0b 55 53 48 89 fb 48 83 78 40 00 74 07 48 83 7f 38 00 75 1c
[    0.703754] RIP: driver_register+0x13/0xb4 RSP: ffffc9000002beb0
[    0.716864] ---[ end trace ded8b4a83b29af3d ]---
[    0.717272] Kernel panic - not syncing: Fatal exception
[    0.717739] Kernel Offset: disabled

After poking around a little, the crash happens when the wmi-bmof driver gets registered (WMI is part of the ACPI subsystem) and an upstream commit (0dda2bb62423 ("driver-core: return EINVAL error instead of BUG_ON()")) fixes the crash. The fix is in 4.18.x or newer but it does not look like the upstream commit has been marked for back porting to older stable releases. The wmi-bmof driver, which triggers the bug, was introduced with 4.13.x

alindima commented 3 years ago

It seems that this kernel bug is triggered on older kernels, but only if the wmi-bmof ACPI driver is configured. Since Firecracker does not currently use ACPI, so I think running the guest kernel with acpi=off should work. Moreover, since this is a kernel bug, I assume there isn't much Firecracker can do about it. If there's something I'm missing, feel free to reopen the issue.