containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.22k stars 2.37k forks source link

podman exec into a "-it" container: container create failed (no logs from conmon): EOF #10927

Open edsantiago opened 3 years ago

edsantiago commented 3 years ago

Common thread seems to be:

Running: podman [options] run -dti --name test1 quay.io/libpod/fedora-minimal:latest sleep +Inf
time="2021-06-16T19:33:53-05:00" level=warning msg="The input device is not a TTY. The --tty and --interactive flags might not work properly"
99e3b419a97aa408a4d0d3072bbd00579d5edd7c97790aa06d61f233cfdc1b4c
Running: podman [options] exec -ti test1 true
Running: podman [options] exec -ti test1 true       ! sometimes it fails on the first, sometimes on the third
Error: container create failed (no logs from conmon): EOF

Podman exec [It] podman exec terminal doesn't hang

And also just now in a still-live PR (my flake-xref does not handle live PRs): int podman ubuntu-2104 root host

Note: the March and April logs above have been garbagecollected, I can't confirm that the error is the same one. I'm leaving them in the report deliberately, in case it helps to have a timestamp for the start of this flake (i.e. it might not be new in June).

Edit: this is podman, not podman-remote, so it's unlikely to be the same as #7360

edsantiago commented 1 year ago

A really funny thing happened. I disabled the 3-flake-retry thing in int tests, and all of a sudden this flake exploded. In fact I can almost not even get CI to pass on my PR

edsantiago commented 1 year ago

These are just the failures from yesterday.

edsantiago commented 1 year ago

Just a few of the flakes from yesterday, in only one PR:

When I say "just a few", it's because I'm close to giving up on flake triage. This is by far the number one flake, and it's kind of draining to keep wasting time reporting it.

@containers/podman-maintainers can someone please, please fix this?

edsantiago commented 1 year ago

@containers/podman-maintainers pleeeeeeeze

edsantiago commented 1 year ago

Tagging with bugweek because this is by far our number one flake. It never seems to triple-flake, all it does is fail once then succeed on retry, but this is the most critical issue blocking us from even thinking about #17967.

Luap99 commented 1 year ago

Didn't we used to store the journal in the cirrus task? I cannot see them there. Maybe the journal tells us why it is conmon is exiting.

edsantiago commented 1 year ago

Uh, we still do? From any colorized log, go to top, click on the Task link (not Build), scroll to bottom, see the Run journal accordion. Note that the accordion is incomplete, you have to click on the box-arrow at top right to take you to the full log.

Luap99 commented 1 year ago

Thanks I was looking at the Artifacts at top of the page.

Luap99 commented 1 year ago

@edsantiago Did you ever see this outside of podman exec? I am looking at the code path were the error is returned and it could also effect podman run but I don't see this in the flake log so it must be something only in exec that triggers it.

edsantiago commented 1 year ago

Interesting. No, I've only seen it in exec. And I'm seeing it multiple times per day, too, and have developed a strong feel for the failure pattern. I would notice if it were anything different than usual.

Luap99 commented 1 year ago

Does this still reproduce on a 1minutetip VM? What we should do is to run the exec command with --log-level=debug that tells conmon to log to the journal. In such case if conmon really errors, we should see a error message there.

edsantiago commented 1 year ago

Yep, just reproduced on 1mt f37:

#  while :;do podman --log-level=debug exec -it test true || break;done
...took a while...
DEBU[0000] running conmon: /usr/bin/conmon               args="[--api-version 1 -c 40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd -u 4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c -r /usr/bin/crun -b /var/lib/containers/storage/overlay-containers/40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd/userdata/4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c -p /var/lib/containers/storage/overlay-containers/40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd/userdata/4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c/exec_pid -n test --exit-dir /var/lib/containers/storage/overlay-containers/40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd/userdata/4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c/exit --full-attach -s -l none --log-level debug --syslog -t -i -e --exec-attach --exec-process-spec /var/lib/containers/storage/overlay-containers/40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd/userdata/4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c/exec-process-2394717713 --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /run/containers/storage --exit-command-arg --log-level --exit-command-arg debug --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /run/libpod --exit-command-arg --network-config-dir --exit-command-arg  --exit-command-arg --network-backend --exit-command-arg netavark --exit-command-arg --volumepath --exit-command-arg /var/lib/containers/storage/volumes --exit-command-arg --db-backend --exit-command-arg boltdb --exit-command-arg --transient-store=false --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mountopt=nodev,metacopy=on --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg --syslog --exit-command-arg container --exit-command-arg cleanup --exit-command-arg --exec --exit-command-arg 4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c --exit-command-arg 40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd]"
INFO[0000] Running conmon under slice machine.slice and unitName libpod-conmon-40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd.scope 
DEBU[0000] Sending resize events to exec session 4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c 
DEBU[0000] Attaching to container 40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd exec session 4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c 
DEBU[0000] Received: 0                                  
Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...
DEBU[0000] Shutting down engines

I think, but am not sure, that this might be the relevant journal log:

... conmon[13913]: conmon 40f7406aeb2b58854ae0 <error>: Runtime did not set up terminal
... /usr/bin/podman[13900]: time="2023-04-27T09:35:06-04:00" level=error msg="Running container cleanup: container 40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd has no exec session with ID 4a64c746e7dc1a13e63d8292d271a2e6cc1745a64c5cd23e62c61d3e3c3b2559: no such exec session"
Luap99 commented 1 year ago

Ok great, I can see where I go from there. I can see in the conmon code that the error message causes an early exit without writing a status back to us so it would explain the podman error.

edsantiago commented 1 year ago

No wait there's a lot more that might be relevant. journal-exec-ti.log

edsantiago commented 1 year ago

The resize entries make me wonder if this is related to #10710

Luap99 commented 1 year ago

So here is what I see, this patch can be used to 100% reproduce this. Of course this patch is invalid but it shows where the problem seems to be.

diff --git a/src/conmon.c b/src/conmon.c
index 71f4d49..0fe7a3e 100644
--- a/src/conmon.c
+++ b/src/conmon.c
@@ -338,7 +338,7 @@ int main(int argc, char *argv[])
                g_unix_fd_add(seccomp_socket_fd, G_IO_IN, seccomp_accept_cb, csname);

        if (csname != NULL) {
-               g_unix_fd_add(console_socket_fd, G_IO_IN, terminal_accept_cb, csname);
+               //g_unix_fd_add(console_socket_fd, G_IO_IN, terminal_accept_cb, csname);
                /* Process any SIGCHLD we may have missed before the signal handler was in place.  */
                if (!opt_exec || !opt_terminal || container_status < 0) {
                        GHashTable *exit_status_cache = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);

I have a hard time understanding how these g_unix_fd_add actually work. I just am going to assume that this is async code. The callback terminal_accept_cb() sets the mainfd_stdout var to the actual fd. And the check here (which is the one failing) errors out. So In my understanding there is a simple race between the callback and then continuing below and making the check against the mainfd_stdout fd.

So how can we make sure the callback is run before that check?

In any case this is how I undersatnd the situation. I could be totally wrong but the fact is that

if (opt_terminal && mainfd_stdout == -1)
    nexit("Runtime did not set up terminal");

is causing the error that we see in the jounal.

giuseppe commented 1 year ago

great analysis!

I've looked into that code in conmon and I've found some issues that could be related: https://github.com/containers/conmon/pull/411

g_unix_fd_add should report when the FD is ready, even if it is already when we register the callback. I've done some tests by adding an artificial sleep() before g_unix_fd_add and it seems to be handled correctly.

The OCI runtime blocks until the receiver (conmon in this case) receives the terminal fd, so I'd guess there is no race. What I've observed is that a failure in the terminal_accept_cb is not handled and could lead to the same failure.

I've opened a PR for conmon, not sure if it has any impact for the current issue though: https://github.com/containers/conmon/pull/411

Luap99 commented 1 year ago

Thanks @giuseppe, given we have a somewhat solid reproducer on 1mt VMs I compiled your patch and currently test it there. Previously it failed within ~15 mins for me, now it is already running for 15 mins, if it doesn't fail in the next hour I would be willing to call this fixed.

Luap99 commented 1 year ago

Update: I was unable to reproduce with either the patched or unpatched version today. So I cannot say if this fixed anything. I guess I need @edsantiago's special hands to prep the VM in order to reproduce.

edsantiago commented 1 year ago

Hi. Sorry for neglecting this.

Problem reproduces on f38, with podman @ 03578810759b35e9f15e4452a29d8667f1fcced1 and conmon @ d5564f8c3693d2e54b45555fd8f3b1596b0e6d77 (which includes PR 411). Also with conmon-2.1.7-2.fc38.x86_64 but that's expected.

# bin/podman run -d --name test quay.io/libpod/testimage:20221018 top
# while :;do bin/podman --log-level=debug exec -it test true || break;done
....takes a while, but eventually stops with the usual error

With @Luap99's patch above (comment out g_unix_fd_add) it reproduces 100% (or at least five out of five tries).

Each of my different-conmon attempts involved first podman rm -f -a -t0 to stop the running container.

edsantiago commented 1 year ago

By changing podman exec ... true to ... date, the debug logs change as follows:

  INFO[0000] Running conmon under slice ...
  DEBU[0000] Sending resize events to exec session
  DEBU[0000] Attaching to container
  DEBU[0000] Received: 0
- Error: container create failed (no logs from conmon
+ DEBU[0000] Received: 352094
+ DEBU[0000] Successfully started exec session ... in container ....
+ Thu May 25 19:09:42 UTC 2023
+ [conmon:d]: exec with attach is waiting for start message from parent
+    [indented because of \r]   [conmon:d]: exec with attach got start message from parent
+     [indented more]      DEBU[0000] Container ... exec session ... completed with exit code 0
... lots more ...

This suggests to me that the error happens when setting up fds, not in closing them. (This may have already been obvious to you all).

edsantiago commented 12 months ago

Quick reminder that this is still happening, even if we don't see it in normal CI (because of flake retries). Last 60 days:

Seen in: podman/remote fedora-37/fedora-38/fedora-39/rawhide root/rootless container/host boltdb/sqlite

edsantiago commented 11 months ago

ARGH! I added exec -t tests in system tests -- which do not have FlakeRetry -- and, blam.

This is a horrible flake. It is going to start hitting us once a day or more. Is there any chance at all that someone can fix it? Pretty please??? The only other choice is to remove the exec -t system test.

edsantiago commented 4 months ago

Here's a surprise: same thing, but on run, not exec f39 root container:

# podman [options] run -dt quay.io/libpod/alpine:latest
Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...
Luap99 commented 4 months ago

Here's a surprise: same thing, but on run, not exec f39 root container:

# podman [options] run -dt quay.io/libpod/alpine:latest
Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...

I don't run/exec matter much, also this error can be triggered in many ways it is juts a symptom of conmon exiting or crashing without sending us the proper data we want AFAICT

$ podman --conmon /usr/bin/false  run --rm quay.io/libpod/testimage:20240123 true
Error: conmon failed: exit status 1
$ podman --conmon /usr/bin/true  run --rm quay.io/libpod/testimage:20240123 true
Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...
edsantiago commented 3 weeks ago

Observation, possibly meaningless: this is triggering often in parallel bats tests. That's unusual: this has almost always been an e2e flake.

x x x x x x
int(4) podman(7) debian-13(4) root(7) host(7) sqlite(6)
sys(4) remote(1) fedora-39(3) rootless(1) container(1) boltdb(2)
fedora-40(1)
Luap99 commented 3 weeks ago

I would assume it correlates to the system load, given you run in parallel now scheduling delays with be most likely be higher between the individual threads/processes thus detecting more race conditions.

Looking in the journal for the container name I found this:

Aug 27 15:37:07 cirrus-task-6657029832441856 podman[135965]: 2024-08-27 15:37:07.90776318 -0500 CDT m=+0.096462576 container exec f079d2c2b5726194b21cfbf446052e3ec8cece8f28f0723652bc196d4c0759c5 (image=quay.io/libpod/testimage:20240123, name=c--explicit_RUN_term-t148-3n5hvd07, created_at=2024-01-24T18:38:28Z, created_by=test/system/build-testimage, io.buildah.version=1.34.1-dev)
Aug 27 15:37:07 cirrus-task-6657029832441856 conmon[135999]: conmon f079d2c2b5726194b21c <nwarn>: Failed to open cgroups file: /proc/136012/cgroup
Aug 27 15:37:07 cirrus-task-6657029832441856 conmon[135999]: conmon f079d2c2b5726194b21c <nwarn>: Failed to get cgroup path. Container may have exited
Aug 27 15:37:07 cirrus-task-6657029832441856 conmon[135999]: conmon f079d2c2b5726194b21c <nwarn>: Failed to open cgroups file:
Aug 27 15:37:07 cirrus-task-6657029832441856 podman[135965]: 2024-08-27 15:37:07.93562878 -0500 CDT m=+0.124328176 container exec_died f079d2c2b5726194b21cfbf446052e3ec8cece8f28f0723652bc196d4c0759c5 (image=quay.io/libpod/testimage:20240123, name=c--explicit_RUN_term-t148-3n5hvd07, created_at=2024-01-24T18:38:28Z, created_by=test/system/build-testimage, io.buildah.version=1.34.1-dev)

and also this:

Aug 27 15:37:07 cirrus-task-6657029832441856 conmon[135892]: conmon f079d2c2b5726194b21c <error>: Runtime did not set up terminal

I wonder if this is related: https://github.com/containers/crun/issues/1524