Flaky test `test_delete_after_create` - failed to remove cgroup #422 - Githubissues

containerd / runwasi

Facilitates running Wasm / WASI workloads managed by containerd

Apache License 2.0

1.02k stars 84 forks source link

Flaky test `test_delete_after_create` - failed to remove cgroup #422 #423

Open jsturtevant opened 7 months ago

jsturtevant commented 7 months ago

          same failure as on #422

2023-12-11T16:24:37.6263250Z [1m[32m    Finished[0m test [unoptimized] target(s) in 28.73s
2023-12-11T16:24:38.0688247Z ././target/build/x86_64-unknown-linux-musl//x86_64-unknown-linux-musl/debug/deps/containerd_shim_wasmer-4cf9e40c15afbde7 --nocapture --test-threads=1
2023-12-11T16:24:38.1117914Z 
2023-12-11T16:24:38.1118414Z running 7 tests
2023-12-11T16:24:38.6147511Z test wasmer_tests::test_custom_entrypoint ... ok
2023-12-11T16:24:38.7404433Z [2023-12-11T16:24:38Z ERROR libcontainer::container::container_delete] failed to remove cgroup due to: V2(WrappedIo(Other { err: Custom { kind: TimedOut, error: "could not delete" }, path: "/sys/fs/cgroup/test" })) cgroup_path="test"
2023-12-11T16:24:38.7408303Z Error: io error: at /sys/fs/cgroup/test: could not delete
2023-12-11T16:24:38.7409256Z 
2023-12-11T16:24:38.7409508Z Caused by:
2023-12-11T16:24:38.7409998Z     0: io error: at /sys/fs/cgroup/test: could not delete
2023-12-11T16:24:38.7411140Z     1: at /sys/fs/cgroup/test: could not delete
2023-12-11T16:24:38.7412084Z test wasmer_tests::test_delete_after_create ... FAILED
2023-12-11T16:24:39.6997094Z test wasmer_tests::test_exit_code ... ok
2023-12-11T16:24:40.4872602Z test wasmer_tests::test_has_default_devices ... ok
2023-12-11T16:24:40.9817523Z test wasmer_tests::test_hello_world ... ok
2023-12-11T16:24:41.9711765Z test wasmer_tests::test_seccomp ... ok
2023-12-11T16:24:42.3465552Z [2023-12-11T16:24:42Z ERROR wasmer_wasix::state::env] wasi[1]::wasi initialize error (Missing export memory)
2023-12-11T16:24:42.8445822Z test wasmer_tests::test_unreachable ... ok
2023-12-11T16:24:42.8446223Z 
2023-12-11T16:24:42.8446548Z failures:
2023-12-11T16:24:42.8446734Z 
2023-12-11T16:24:42.8446917Z failures:
2023-12-11T16:24:42.8447238Z     wasmer_tests::test_delete_after_create
2023-12-11T16:24:42.8447557Z 
2023-12-11T16:24:42.8448359Z test result: FAILED. 6 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 4.73s
2023-12-11T16:24:42.8449129Z 
2023-12-11T16:24:42.8470890Z make: *** [Makefile:126: test-wasmer] Error 123
2023-12-11T16:24:42.8487753Z ##[error]Process completed with exit code 2.

Originally posted by @jsturtevant in https://github.com/containerd/runwasi/issues/420#issuecomment-1850500767

Mossaka commented 6 months ago

Another instance:

https://github.com/containerd/runwasi/actions/runs/7332652774/job/19967109853?pr=428

Mossaka commented 6 months ago

https://github.com/containerd/runwasi/actions/runs/7450055138/job/20268090826?pr=444

jprendes commented 5 months ago

AFAICS the failing test point is always test_delete_after_create. Unlike the other test points, this one never calls start() nor wait(). I can imagine two options:

Some setup is happening in start that this tespoint doesn't hit. But then I would expect this test point to always fail.
There's some race condition between delete and the cgroups path being created. As we usually wait for the child process to finish, this is not a proble, but in this test point we might be calling delete too quickly? This seems the most likely to me.

I don't have the bandwidth to validate any of this at the moment, but I can take a look next week.

jprendes commented 5 months ago

The failure is not specific to wasmer. I can reproduce with any runtime.

I'm running

while RUST_LOG=trace cargo test --quiet --package containerd-shim-wasmtime -- --nocapture --test-threads=1 test_delete_after_create; do echo .; done

That sporadically fails to remove the cgroup folder (that's why it's in a loop).

It seems that youki's libcgroups is already trying to work around this issue (see https://github.com/containers/youki/pull/63 and https://github.com/containers/youki/pull/333), where youki tries to delete the folder a few times with a small delay between attempts.

The underlying cause is that trying to remove the folder results in Err(Os { code: 16, kind: ResourceBusy, message: "Device or resource busy" }). Locally, I usually see that the first attempt fails, and the second succeeds, but some times all 4 attempts fail.

jprendes commented 5 months ago

When we delete the container, we send SIGKILL to the container init process. IIUC, it takes some time from when we send SIGKILL to the process, until the kernel allows us to delete the cgroup. My best guess is that we could increase the number of attempts that youki makes to remove the folder. But that doesn't explain why we only see the failure in this test and not others. I should try reproduce this with youki alone.

ping @utam0k

utam0k commented 5 months ago

Other test cases in this file take 10 seconds. It may be related... I'm not sure... https://github.com/containerd/runwasi/blob/71f8df9cc576ea9564b3ea692432e80b454da7e5/crates/containerd-shim-wasmer/src/tests.rs#L12

It seems runc attempts 100 retries. We need to implement the same logic... 😭 https://github.com/opencontainers/runc/blob/02120488a4c0fc487d1ed2867e901eeed7ce8ecf/libcontainer/state_linux.go#L49-L51 https://github.com/opencontainers/runc/blob/02120488a4c0fc487d1ed2867e901eeed7ce8ecf/delete.go#L16-L25

Mossaka commented 5 months ago

That 100 retries is wild.