bug: unreliable behavior in stress tests `failed to delete container`

I'm currently evaluation OCI implementations and youki is one of them. In the scenario I created each OCI runtime gets called via containerd and just creates (and removes) a short lived container.

I found that youki occasionally failed in these tests. Out of 1000 runs 36 failed. The error was:

time="2024-06-22T13:33:12Z" level=fatal msg="failed to create shim task: OCI runtime create failed: youki did not terminate successfully: exit status 1: unknown"
Command exited with non-zero status 1
time="2024-06-22T13:33:23Z" level=error msg="failed to delete task: failed to delete task: youki did not terminate successfully: exit status 1: error in executing command: failed to delete container 88883548369aa91be047a519708e2b3952803bf0525b35b41f4520722e2b7fd1\n\nCaused by:\n    0: systemd error: dbus error: dbus function call error: expected to get a reply for method call, didn't get any\n    1: dbus error: dbus function call error: expected to get a reply for method call, didn't get any\n    2: dbus function call error: expected to get a reply for method call, didn't get any\nError: failed to delete container 88883548369aa91be047a519708e2b3952803bf0525b35b41f4520722e2b7fd1\n\nCaused by:\n    0: systemd error: dbus error: dbus function call error: expected to get a reply for method call, didn't get any\n    1: dbus error: dbus function call error: expected to get a reply for method call, didn't get any\n    2: dbus function call error: expected to get a reply for method call, didn't get any\n: unknown"

The command run was nerdctl run --network=none --rm --runtime youki busybox true

System Info:

Version           0.3.3
Commit            c03edea
Kernel-Release    6.1.0-21-amd64
Kernel-Version    #1 SMP PREEMPT_DYNAMIC Debian 6.1.90-1 (2024-05-03)
Architecture      x86_64
Operating System  Debian GNU/Linux 12 (bookworm)
Cores             8
Total Memory      31335
Cgroup setup      unified
Cgroup mounts
Namespaces        enabled
  mount           enabled
  uts             enabled
  ipc             enabled
  user            enabled
  pid             enabled
  network         enabled
  cgroup          enabled
Capabilities
CAP_BPF           available
CAP_PERFMON       available
CAP_CHECKPOINT_RESTORE available

Did you encounter something similar in testing or know what this caused?

Beside that I may think there could be a memory leak (even if is written in rust). The test for youki was started on the yellow dot and ended 15:38:40. System has about 50MB more memory usage that before the test. It is not large and I could be mistaken but it's just something that i noticed in comparison with other OCI implementations.

Hey @hegerdes , thanks for reporting this!

The issue with dbus reply is know one (https://github.com/containers/youki/issues/2559) , but unfortunately it is so intermittent that we are not able to reliably reproduce it in order to dbug. Would you be able to help with this, as you seem to have a way to trigger it repeatedly? Can you let us know what is your setup for tests, so that we can try to reproduce it?

The suspected reason behind this failure is due to some error in our dbus client impl, it does not correctly receive reply from dbus, and thus errors.

As for the memory leak, we don't have any specific checks for that, we mostly rely on Rust to correctly manage it, and use unsafe carefully. Given that youki runs as a process and OS should dealloc all memory after the process is terminated, it may or may not be the issue with testing setup, maybe the setup leaks some when the tests fail? Not sure.

Would you be able to help with this, as you seem to have a way to trigger it repeatedly? Can you let us know what is your setup for tests, so that we can try to reproduce it?

Sure I can try to help. The problem is that I can trigger it repeatedly enough to be a problem I an actual workload but to few times to be annoying to debug.

For testing I use a ansible-playbook to setup a debian VM with a bunch on OCI runtimes and containerd. I then use cri-bench.sh script to run it X times via nerdctl.

I just run it again with some debug flags but there was not much more I got. Here some more logs:

# Nerdctl logs
time="2024-06-23T12:42:06Z" level=debug msg="verifying process skipped"
time="2024-06-23T12:42:06Z" level=debug msg="generated log driver: binary:///usr/local/bin/nerdctl?_NERDCTL_INTERNAL_LOGGING=%2Fvar%2Flib%2Fnerdctl%2F1935db59"
time="2024-06-23T12:42:06Z" level=debug msg="remote introspection plugin filters" filters="[type==io.containerd.snapshotter.v1, id==overlayfs]"
time="2024-06-23T12:42:06Z" level=error msg="failed to delete task: failed to delete task: youki did not terminate successfully: exit status 1: error in executing command: failed to delete container 2552391f27c5197e7c4c59bbcbd1d9bf6c33453e078aa8b1c23201cf8b461ae3\n\nCaused by:\n    0: systemd error: dbus error: dbus function call error: expected to get a reply for method call, didn't get any\n    1: dbus error: dbus function call error: expected to get a reply for method call, didn't get any\n    2: dbus function call error: expected to get a reply for method call, didn't get any\nError: failed to delete container 2552391f27c5197e7c4c59bbcbd1d9bf6c33453e078aa8b1c23201cf8b461ae3\n\nCaused by:\n    0: systemd error: dbus error: dbus function call error: expected to get a reply for method call, didn't get any\n    1: dbus error: dbus function call error: expected to get a reply for method call, didn't get any\n    2: dbus function call error: expected to get a reply for method call, didn't get any\n: unknown"

# Containerd logs
Jun 23 12:42:15 168.119.165.86 containerd[10489]: time="2024-06-23T12:42:15.652327976Z" level=warning msg="error from *cgroupsv2.Manager.EventChan" error="failed to add inotify watch for \"/sys/fs/cgroup/system.slice/nerdctl-f2616157bd55f09a747a8e8a0c15669bc6ab67daabc4775be85adf07e6d68fe9.scope/memory.events\": no such file or directory"
Jun 23 12:42:15 168.119.165.86 containerd[10489]: time="2024-06-23T12:42:15.713743022Z" level=info msg="shim disconnected" id=f2616157bd55f09a747a8e8a0c15669bc6ab67daabc4775be85adf07e6d68fe9 namespace=default
Jun 23 12:42:15 168.119.165.86 containerd[10489]: time="2024-06-23T12:42:15.713790135Z" level=warning msg="cleaning up after shim disconnected" id=f2616157bd55f09a747a8e8a0c15669bc6ab67daabc4775be85adf07e6d68fe9 namespace=default
Jun 23 12:42:15 168.119.165.86 containerd[10489]: time="2024-06-23T12:42:15.713802829Z" level=info msg="cleaning up dead shim" namespace=default

# Another containerd error
Jun 23 13:01:21 debian-4gb-nbg1-2 containerd[878]: time="2024-06-23T13:01:21.211758273Z" level=warning msg="cleanup warnings time=\"2024-06-23T13:01:21Z\" level=debug msg=\"starting signal loop\" namespace=default pid=5979 runtime=io.containerd.runc.v2\ntime=\"2024-06-23T13:01:21Z\" level=warning msg=\"failed to read init pid file\" error=\"open /run/containerd/io.containerd.runtime.v2.task/default/b4d018e85ccc815304f278bb8cea0b00abf07ebb9ace95ec13c70fea9d673497/init.pid: no such file or directory\" runtime=io.containerd.runc.v2\n" namespace=default
Jun 23 13:01:21 debian-4gb-nbg1-2 containerd[878]: time="2024-06-23T13:01:21.212888661Z" level=error msg="copy shim log" error="read /proc/self/fd/13: file already closed" namespace=default

It seems like containerd/youki is trying to reference a file for the created container that does not exist. Might be due to a race condition?

Let me know how I can set some more debug flags and I will try again next week.

May I ask you to try the same test without dbus? It means that I want you to try it with cgroupfs.

The problem is that I can trigger it repeatedly enough to be a problem I an actual workload but to few times to be annoying to debug.

Yes, exactly!

It seems like containerd/youki is trying to reference a file for the created container that does not exist. Might be due to a race condition?

The nerdctl logs show that it is the same dbus error , I feel that the latter logs are a knock-on effect where due to the failure in starting container, the deletion and other commands also error out.

Let me know how I can set some more debug flags and I will try again next week.

My thoughts are this : The issue is in dbus client in youki, which for some reason flakes occasionally. To figure out the issue, we need to find where the youki dbus call is erroring, and what are the dbus logs at the time. For this, we need to trace dbus logs on the host - either the k8s node or underlying VM where this failure occurs. So if you have access to the machine, we can create and delete containers in loop while tracing the logs. I have made a script (added below) which does this using podman, and dbus monitor. I was able to get the logs in once case, but unfortunately I hadn't added the extra logs in youki code to see which dbus function call failed. Since then I have been trying to run the script in a Vagrant VM, and get the error, but not yet successful. If possible, you can try this script on the machine.

#! /bin/bash

# set -x

ITER_COUNT=500
PODMAN_LOG_FILE="./podman_logs.txt"
DBUS_LOG_FILE="./dbus_log.txt"

for i in $(seq 1 $ITER_COUNT); do

  echo "ITER $i"

  ERRORED=0
  rm -f $DBUS_LOG_FILE
  rm -f $PODMAN_LOG_FILE

  podman rm -f test 2> /dev/null

  dbus-monitor --session > $DBUS_LOG_FILE 2>&1 &
  DBUS_MONITOR_PID=$!

  podman create --name test --runtime $PWD/youki alpine >> $PODMAN_LOG_FILE 2>&1 
  ERRORED=$(expr $ERRORED + $?)

  podman start test >> $PODMAN_LOG_FILE 2>&1 
  ERRORED=$(expr $ERRORED + $?)

  podman stop test >>  $PODMAN_LOG_FILE 2>&1 
  ERRORED=$(expr $ERRORED + $?)

  podman rm -f test >> $PODMAN_LOG_FILE 2>&1 
  ERRORED=$(expr $ERRORED + $?)

  kill -SIGKILL $DBUS_MONITOR_PID > /dev/null 2>&1 
  wait $DBUS_MONITOR_PID 2>/dev/null

  if [ $ERRORED -ne 0 ]; then
    echo "FOUND ERROR"
    cat $PODMAN_LOG_FILE | grep dbus
    if [ $? -eq 0 ]; then
      echo "ERROR was dbus error, exiting"
      exit 1
    fi
    echo "Not dbus error : " && cat $PODMAN_LOG_FILE
  fi
done

echo "NO issues"

rm $PODMAN_LOG_FILE
rm $DBUS_LOG_FILE

Yeah, sure I will try both your suggestions and will report back if I find anything useful. Hopefully I recognize it, because I'm not really familiar with the low level systemd, dbus and cgroup stuff. Just wanted to let you know that I will not be able to tackle this before Tuesday/Wednesday.

@YJDoc2 Thomas(@Furisto ) can help with this issue. Thanks, @Furisto. Please collaborate with him to solve it 🙏

Hey, So I did more debugging, and found that the RC for this is that the current implementation of DBus client only receives the first message that is sent on the socket, and does not wait for any more. If the message is not method return or method error, we ignore them, and thus it gets reported as no message received. However, actually we get a message for the signal and then if we wait again, we get the actual method return message. Even though I'm not 100% sure, my current theory of why this is flaky and not always so is that -

we first establish auth
then we do a hello method call required by the protocol definition
then we do the actual calls needed for youki

The signal is sent after the hello call, as part of the hello call's response. However, if the signal is sent by dbus within 11-15 micro (?) seconds, then that is received at the same time as the method return of the hello call and ignored. However, If it takes more than 15-20 micro seconds, then the hello call buffer only gets the method return message, and the next method call gets the signal in response. However, as we only wait for the first message and not all the messages (dbus communicates on same socket which is kept open across calls, hence we cannot indefinitely wait) , we return only the signal message from the buffer and later ignore it to get "no message received" error.

In the "correct" client implementation we would keep consuming till we get a message matching the serial number we sent in the method call message, indicating that the communication is completed, but we haven't done that.

I also ran youki with strace, and that reduced the frequency of these errors from 6-7 in 500 iters to 1-2 in 500 iters. As strace slows down the execution speed, there is higher chance of signal getting received in same buffer as the hello call, thus supporting the above explanation.

I tried running it with my branch in PR https://github.com/containers/youki/pull/2838 , and it did not error once in 1500 iterations.

@hegerdes , can I request you that instead of running the script, run the original nerdctl test, but with my PR branch youki? And if there is no error in 1000 iterations, run maybe upto 5000-10000 to be sure that we have actually solved, not just reduced the frequency? In case you have issues with building youki locally, I can try and see if I can give you the binary built on PR branch.

@Furisto , if possible, can you double check my logic and the PR? Thanks!

containers / youki

bug: unreliable behavior in stress tests `failed to delete container` #2826