IRQ handler failure results in hung enclave and broken system

Kernel version: 5.15.0-1031-aws Nitro CLI version: 1.2.2 Instance type: m6i.2xlarge

Kernel module info:

filename:       /lib/modules/5.15.0-1031-aws/kernel/drivers/virt/nitro_enclaves/nitro_enclaves.ko
license:        GPL v2
description:    Nitro Enclaves Driver
author:         Amazon.com, Inc. or its affiliates
srcversion:     55EA5FE324256DEAE204522
alias:          pci:v00001D0Fd0000E4C1sv*sd*bc*sc*i*
depends:
retpoline:      Y
intree:         Y
name:           nitro_enclaves
vermagic:       5.15.0-1031-aws SMP mod_unload modversions
sig_id:         PKCS#7
signer:         Build time autogenerated kernel key
sig_key:        49:A9:55:87:90:5B:33:41:AF:C0:A7:BE:2A:71:6C:D2:CA:34:E0:AE
sig_hashalgo:   sha512
signature:      37:40:0C:76:41:C3:A6:E4:52:3F:0F:3F:5C:BB:09:5F:22:C2:45:0C:
                8A:3E:44:0C:B2:96:BB:69:42:70:40:C5:46:FC:1F:7E:6F:23:67:B5:
                C1:C0:5F:7F:7D:25:15:F8:8A:20:FC:27:50:4D:57:47:E6:AB:F6:DE:
                71:8E:70:FE:06:2B:DD:0B:2F:5A:18:03:B7:98:35:70:D0:F3:AC:A6:
                39:16:C5:5C:26:E1:4A:13:07:46:FE:54:9D:A3:43:11:16:4B:C5:77:
                E7:99:56:9F:DF:C0:61:14:4D:6E:EB:A9:6E:18:2B:3A:29:CF:D0:AC:
                18:46:85:7E:90:2E:B0:A3:2B:F0:7F:54:C3:0D:72:10:30:31:D4:53:
                33:11:6E:A4:D2:C5:8D:4F:31:AA:5C:9E:8E:75:C7:06:0C:8C:87:A4:
                91:42:81:E3:DC:7B:E8:75:F5:F1:15:D7:17:8E:FE:C0:8B:7C:24:05:
                3F:0C:A7:D1:47:8F:E7:85:65:DB:B9:7F:E4:C7:11:3F:A3:AE:A9:ED:
                F4:DF:C6:16:E6:13:12:B4:40:37:97:40:3E:87:9E:1F:5E:DC:49:7E:
                05:8C:57:87:2C:74:25:50:35:A0:75:C4:BB:1E:6B:B2:F9:33:05:0C:
                6C:05:FC:F5:55:5D:17:B2:05:B1:18:93:D7:C5:E1:39:4B:B5:B4:E8:
                24:BC:B2:18:ED:37:8D:68:91:9B:0D:91:E3:77:8F:7C:92:30:77:CA:
                E0:52:C4:CD:1F:09:08:9E:1A:B8:F5:04:B6:2F:94:23:44:50:D2:F7:
                04:07:D9:84:D7:95:88:9B:27:E8:B5:6B:ED:A1:2E:77:25:A4:83:88:
                EA:D1:4F:6C:FB:94:7C:24:4E:DD:99:62:36:60:1D:A1:7E:36:E7:BF:
                1A:32:4A:02:48:21:1A:72:5F:96:A2:25:6E:F0:74:B1:9E:92:B0:0B:
                DB:7A:BE:95:29:C2:7A:7F:91:5F:74:03:7C:6A:D7:9B:85:DB:FB:62:
                29:9B:B5:A3:9A:5D:91:CE:24:D7:AD:04:F5:69:0B:97:2B:EB:C8:94:
                CB:EB:18:AC:19:0B:43:53:3C:25:56:F8:EF:65:36:29:F3:82:35:D6:
                34:FB:41:1C:F3:21:11:97:AA:41:DE:25:74:84:61:3A:BF:33:EC:8F:
                C4:57:7F:8E:62:BE:BE:8C:47:F1:D1:25:38:B7:86:C3:AE:BA:8C:AB:
                94:D1:5E:B2:78:BC:38:E6:26:D6:E8:13:B0:01:4A:73:58:5D:D1:87:
                51:D3:BD:D0:34:2D:99:98:85:07:EA:CE:00:DF:C0:AD:7B:13:5B:81:
                0B:30:D8:89:E5:DD:FD:F6:AE:D2:3A:D8
parm:           ne_cpus:<cpu-list> - CPU pool used for Nitro Enclaves```

Allocator config:

---
# Enclave configuration file.
#
# How much memory to allocate for enclaves (in MiB).
memory_mib: 22108
#
# How many CPUs to reserve for enclaves.
cpu_count: 6
#
# Alternatively, the exact CPUs to be reserved for the enclave can be explicitely
# configured by using `cpu_pool` (like below), instead of `cpu_count`.
# Note: cpu_count and cpu_pool conflict with each other. Only use exactly one of them.
# Example of reserving CPUs 2, 3, and 6 through 9:
# cpu_pool: 2,3,6-9

Enclave config:

{
  "EnclaveName": "...",
  "EnclaveID": "...",
  "ProcessID": 3189409,
  "EnclaveCID": 13,
  "NumberOfCPUs": 2,
  "CPUIDs": [
    1,
    5
  ],
  "MemoryMiB": 5120
}

What is happening? Previously, I found that enclaves with debugging enabled and console attachment that had large amounts of stdout/stderr would eventually hang, and that attempts to restart them would result in the following error:

Start allocating memory...
Started enclave with enclave-cid: 13, memory: 5120 MiB, cpu-ids: [1, 5]
[ E36 ] Enclave boot failure. Such error appears when attempting to receive the `ready` signal from a freshly booted enclave. It arises in several contexts, for instance, when the enclave is booted from an invalid EIF file and the enclave process immediately exits, failing to submit the `ready` signal. In this case, the error backtrace provides detailed information on what specifically failed during the enclave boot process.

From the log file:

  Action: Run Enclave
  Subactions:
    Failed to execute command `Run`
    Failed to trigger enclave run
    Failed to run enclave
    Failed to create enclave
    Waiting on enclave to boot failed with error VsockReceivingError
  Root error file: src/enclave_proc/resource_manager.rs
  Root error line: 605
  Version: 1.2.2

The kernel meanwhile would report errors like this:

Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.443240] irq 41: nobody cared (try booting with the "irqpoll" option)
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446594] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.15.0-1031-aws #35-Ubuntu
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446598] Hardware name: Amazon EC2 m6i.2xlarge/, BIOS 1.0 10/16/2017
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446599] Call Trace:
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446602]  <IRQ>
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446604]  show_stack+0x52/0x5c
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446610]  dump_stack_lvl+0x4a/0x63
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446613]  dump_stack+0x10/0x16
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446615]  __report_bad_irq+0x3a/0xb3
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446618]  note_interrupt.cold+0xb/0x60
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446620]  handle_irq_event+0xa8/0xb0
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446623]  handle_edge_irq+0x8c/0x230
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446625]  __common_interrupt+0x4f/0xe0
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446628]  common_interrupt+0x89/0xa0
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446632]  </IRQ>
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446632]  <TASK>
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446633]  asm_common_interrupt+0x27/0x40
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446637] RIP: 0010:cpu_idle_poll.isra.0+0x33/0xd0
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446639] Code: 53 65 8b 15 47 9e 85 7d 66 90 e8 08 cd 39 ff fb 66 0f 1f 44 00 00 65 48 8b 1c 25 c0 fb 01 00 48 8b 03 a8 08 74 0b eb 1c f3 90 <48> 8b 03 a8 0
8 75 13 8b 05 00 c3 47 01 85 c0 75 ed e8 d7 68 3c ff
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446641] RSP: 0018:ffffffff83803e68 EFLAGS: 00000202
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446644] RAX: 0000000000000001 RBX: ffffffff8381b440 RCX: 0000000000000001
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446645] RDX: 000000005f5a1209 RSI: 0000000000000000 RDI: 000000005f5a120a
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446646] RBP: ffffffff83803e70 R08: 00048be10300d8c4 R09: 00000000000003ed
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446648] R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8381b440
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446649] R13: 0000000000000027 R14: 0000000000000000 R15: 0000000000000000
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446651]  ? cpu_idle_poll.isra.0+0x18/0xd0
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446653]  do_idle+0x45/0xf0
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446656]  cpu_startup_entry+0x20/0x30
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446658]  rest_init+0xd3/0x100
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446659]  ? acpi_enable_subsystem+0x20b/0x217
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446664]  arch_call_rest_init+0xe/0x23
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446669]  start_kernel+0x4a9/0x4ca
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446671]  x86_64_start_reservations+0x24/0x2a
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446673]  x86_64_start_kernel+0xfb/0x106
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446675]  secondary_startup_64_no_verify+0xc2/0xcb
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446680]  </TASK>
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446680] handlers:
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.447978] [<0000000043f4e91b>] vring_interrupt
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.450356] Disabling IRQ #41

The only (apparent) way to allow the system to run any enclaves at this point was a complete reboot - the EIF file is unchanged, so there's nothing wrong with that. I suspected that the issue was caused by either a high rate or high volume of stdout/stderr, so my solution was to disable debugging and create a log shipping pipeline that used a separate vsock connection to get meaningful data out of the enclave. This appears to have been working fine until recently, when we ran into the exact same problem after a few weeks of uptime. It's worth noting that we still attach to the console as a means of blocking until the enclave exits, however there is no output (as expected). It does appear that at some point the enclave crashed (RC=4) after some time of running OK, after which it was no longer able to start due to the ready signal error. The enclaved application appears to have been in a hung state starting from about 02:21+08:00 (when the kernel IRQ error occurred) with the enclave itself crashing at about 17:47+08:00.

aws / aws-nitro-enclaves-cli

IRQ handler failure results in hung enclave and broken system #464