Open jsturtevant opened 7 months ago
AFAICS the failing test point is always test_delete_after_create
.
Unlike the other test points, this one never calls start()
nor wait()
.
I can imagine two options:
start
that this tespoint doesn't hit. But then I would expect this test point to always fail.delete
and the cgroups path being created. As we usually wait for the child process to finish, this is not a proble, but in this test point we might be calling delete
too quickly? This seems the most likely to me.I don't have the bandwidth to validate any of this at the moment, but I can take a look next week.
The failure is not specific to wasmer. I can reproduce with any runtime.
I'm running
while RUST_LOG=trace cargo test --quiet --package containerd-shim-wasmtime -- --nocapture --test-threads=1 test_delete_after_create; do echo .; done
That sporadically fails to remove the cgroup folder (that's why it's in a loop).
It seems that youki's libcgroups is already trying to work around this issue (see https://github.com/containers/youki/pull/63 and https://github.com/containers/youki/pull/333), where youki tries to delete the folder a few times with a small delay between attempts.
The underlying cause is that trying to remove the folder results in Err(Os { code: 16, kind: ResourceBusy, message: "Device or resource busy" })
.
Locally, I usually see that the first attempt fails, and the second succeeds, but some times all 4 attempts fail.
When we delete the container, we send SIGKILL to the container init process. IIUC, it takes some time from when we send SIGKILL to the process, until the kernel allows us to delete the cgroup. My best guess is that we could increase the number of attempts that youki makes to remove the folder. But that doesn't explain why we only see the failure in this test and not others. I should try reproduce this with youki alone.
ping @utam0k
Other test cases in this file take 10 seconds. It may be related... I'm not sure... https://github.com/containerd/runwasi/blob/71f8df9cc576ea9564b3ea692432e80b454da7e5/crates/containerd-shim-wasmer/src/tests.rs#L12
It seems runc
attempts 100 retries. We need to implement the same logic... 😭
https://github.com/opencontainers/runc/blob/02120488a4c0fc487d1ed2867e901eeed7ce8ecf/libcontainer/state_linux.go#L49-L51
https://github.com/opencontainers/runc/blob/02120488a4c0fc487d1ed2867e901eeed7ce8ecf/delete.go#L16-L25
That 100 retries is wild.
Originally posted by @jsturtevant in https://github.com/containerd/runwasi/issues/420#issuecomment-1850500767