Open gmarkey opened 1 year ago
Known issue on our side. We are busy with a fix for it. Reproducible especially when one of the vsock directions is overloaded while other one is relatively unused.
@gmarkey Is it working on your side? The fix for it was deployed quite some time ago
Kernel version: 5.15.0-1031-aws Nitro CLI version: 1.2.2 Instance type: m6i.2xlarge
Kernel module info:
Allocator config:
Enclave config:
What is happening? Previously, I found that enclaves with debugging enabled and console attachment that had large amounts of stdout/stderr would eventually hang, and that attempts to restart them would result in the following error:
From the log file:
The kernel meanwhile would report errors like this:
The only (apparent) way to allow the system to run any enclaves at this point was a complete reboot - the EIF file is unchanged, so there's nothing wrong with that. I suspected that the issue was caused by either a high rate or high volume of stdout/stderr, so my solution was to disable debugging and create a log shipping pipeline that used a separate vsock connection to get meaningful data out of the enclave. This appears to have been working fine until recently, when we ran into the exact same problem after a few weeks of uptime. It's worth noting that we still attach to the console as a means of blocking until the enclave exits, however there is no output (as expected). It does appear that at some point the enclave crashed (RC=4) after some time of running OK, after which it was no longer able to start due to the
ready
signal error. The enclaved application appears to have been in a hung state starting from about 02:21+08:00 (when the kernel IRQ error occurred) with the enclave itself crashing at about 17:47+08:00.