Open edsantiago opened 3 years ago
A really funny thing happened. I disabled the 3-flake-retry thing in int
tests, and all of a sudden this flake exploded. In fact I can almost not even get CI to pass on my PR
These are just the failures from yesterday.
Just a few of the flakes from yesterday, in only one PR:
When I say "just a few", it's because I'm close to giving up on flake triage. This is by far the number one flake, and it's kind of draining to keep wasting time reporting it.
@containers/podman-maintainers can someone please, please fix this?
@containers/podman-maintainers pleeeeeeeze
Tagging with bugweek
because this is by far our number one flake. It never seems to triple-flake, all it does is fail once then succeed on retry, but this is the most critical issue blocking us from even thinking about #17967.
Didn't we used to store the journal in the cirrus task? I cannot see them there. Maybe the journal tells us why it is conmon is exiting.
Uh, we still do? From any colorized log, go to top, click on the Task
link (not Build), scroll to bottom, see the Run journal
accordion. Note that the accordion is incomplete, you have to click on the box-arrow at top right to take you to the full log.
Thanks I was looking at the Artifacts at top of the page.
@edsantiago Did you ever see this outside of podman exec? I am looking at the code path were the error is returned and it could also effect podman run but I don't see this in the flake log so it must be something only in exec that triggers it.
Interesting. No, I've only seen it in exec
. And I'm seeing it multiple times per day, too, and have developed a strong feel for the failure pattern. I would notice if it were anything different than usual.
Does this still reproduce on a 1minutetip VM? What we should do is to run the exec command with --log-level=debug
that tells conmon to log to the journal. In such case if conmon really errors, we should see a error message there.
Yep, just reproduced on 1mt f37:
# while :;do podman --log-level=debug exec -it test true || break;done
...took a while...
DEBU[0000] running conmon: /usr/bin/conmon args="[--api-version 1 -c 40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd -u 4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c -r /usr/bin/crun -b /var/lib/containers/storage/overlay-containers/40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd/userdata/4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c -p /var/lib/containers/storage/overlay-containers/40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd/userdata/4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c/exec_pid -n test --exit-dir /var/lib/containers/storage/overlay-containers/40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd/userdata/4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c/exit --full-attach -s -l none --log-level debug --syslog -t -i -e --exec-attach --exec-process-spec /var/lib/containers/storage/overlay-containers/40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd/userdata/4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c/exec-process-2394717713 --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /run/containers/storage --exit-command-arg --log-level --exit-command-arg debug --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /run/libpod --exit-command-arg --network-config-dir --exit-command-arg --exit-command-arg --network-backend --exit-command-arg netavark --exit-command-arg --volumepath --exit-command-arg /var/lib/containers/storage/volumes --exit-command-arg --db-backend --exit-command-arg boltdb --exit-command-arg --transient-store=false --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mountopt=nodev,metacopy=on --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg --syslog --exit-command-arg container --exit-command-arg cleanup --exit-command-arg --exec --exit-command-arg 4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c --exit-command-arg 40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd]"
INFO[0000] Running conmon under slice machine.slice and unitName libpod-conmon-40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd.scope
DEBU[0000] Sending resize events to exec session 4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c
DEBU[0000] Attaching to container 40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd exec session 4b3f600e6af3b86a9565b77ef3ddb00c91d32873e857cbab60ddb6fe1017717c
DEBU[0000] Received: 0
Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...
DEBU[0000] Shutting down engines
I think, but am not sure, that this might be the relevant journal log:
... conmon[13913]: conmon 40f7406aeb2b58854ae0 <error>: Runtime did not set up terminal
... /usr/bin/podman[13900]: time="2023-04-27T09:35:06-04:00" level=error msg="Running container cleanup: container 40f7406aeb2b58854ae0ae853c42a96ddae584bd28d620ef7a9fb081b2733ffd has no exec session with ID 4a64c746e7dc1a13e63d8292d271a2e6cc1745a64c5cd23e62c61d3e3c3b2559: no such exec session"
Ok great, I can see where I go from there. I can see in the conmon code that the error message causes an early exit without writing a status back to us so it would explain the podman error.
No wait there's a lot more that might be relevant. journal-exec-ti.log
The resize
entries make me wonder if this is related to #10710
So here is what I see, this patch can be used to 100% reproduce this. Of course this patch is invalid but it shows where the problem seems to be.
diff --git a/src/conmon.c b/src/conmon.c
index 71f4d49..0fe7a3e 100644
--- a/src/conmon.c
+++ b/src/conmon.c
@@ -338,7 +338,7 @@ int main(int argc, char *argv[])
g_unix_fd_add(seccomp_socket_fd, G_IO_IN, seccomp_accept_cb, csname);
if (csname != NULL) {
- g_unix_fd_add(console_socket_fd, G_IO_IN, terminal_accept_cb, csname);
+ //g_unix_fd_add(console_socket_fd, G_IO_IN, terminal_accept_cb, csname);
/* Process any SIGCHLD we may have missed before the signal handler was in place. */
if (!opt_exec || !opt_terminal || container_status < 0) {
GHashTable *exit_status_cache = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
I have a hard time understanding how these g_unix_fd_add actually work. I just am going to assume that this is async code. The callback terminal_accept_cb()
sets the mainfd_stdout var to the actual fd. And the check here (which is the one failing) errors out.
So In my understanding there is a simple race between the callback and then continuing below and making the check against the mainfd_stdout fd.
So how can we make sure the callback is run before that check?
In any case this is how I undersatnd the situation. I could be totally wrong but the fact is that
if (opt_terminal && mainfd_stdout == -1)
nexit("Runtime did not set up terminal");
is causing the error that we see in the jounal.
great analysis!
I've looked into that code in conmon and I've found some issues that could be related: https://github.com/containers/conmon/pull/411
g_unix_fd_add
should report when the FD is ready, even if it is already when we register the callback. I've done some tests by adding an artificial sleep()
before g_unix_fd_add
and it seems to be handled correctly.
The OCI runtime blocks until the receiver (conmon in this case) receives the terminal fd, so I'd guess there is no race. What I've observed is that a failure in the terminal_accept_cb
is not handled and could lead to the same failure.
I've opened a PR for conmon, not sure if it has any impact for the current issue though: https://github.com/containers/conmon/pull/411
Thanks @giuseppe, given we have a somewhat solid reproducer on 1mt VMs I compiled your patch and currently test it there. Previously it failed within ~15 mins for me, now it is already running for 15 mins, if it doesn't fail in the next hour I would be willing to call this fixed.
Update: I was unable to reproduce with either the patched or unpatched version today. So I cannot say if this fixed anything. I guess I need @edsantiago's special hands to prep the VM in order to reproduce.
Hi. Sorry for neglecting this.
Problem reproduces on f38, with podman @ 03578810759b35e9f15e4452a29d8667f1fcced1 and conmon @ d5564f8c3693d2e54b45555fd8f3b1596b0e6d77 (which includes PR 411). Also with conmon-2.1.7-2.fc38.x86_64 but that's expected.
# bin/podman run -d --name test quay.io/libpod/testimage:20221018 top
# while :;do bin/podman --log-level=debug exec -it test true || break;done
....takes a while, but eventually stops with the usual error
With @Luap99's patch above (comment out g_unix_fd_add
) it reproduces 100% (or at least five out of five tries).
Each of my different-conmon attempts involved first podman rm -f -a -t0
to stop the running container.
By changing podman exec ... true
to ... date
, the debug logs change as follows:
INFO[0000] Running conmon under slice ...
DEBU[0000] Sending resize events to exec session
DEBU[0000] Attaching to container
DEBU[0000] Received: 0
- Error: container create failed (no logs from conmon
+ DEBU[0000] Received: 352094
+ DEBU[0000] Successfully started exec session ... in container ....
+ Thu May 25 19:09:42 UTC 2023
+ [conmon:d]: exec with attach is waiting for start message from parent
+ [indented because of \r] [conmon:d]: exec with attach got start message from parent
+ [indented more] DEBU[0000] Container ... exec session ... completed with exit code 0
... lots more ...
This suggests to me that the error happens when setting up fds, not in closing them. (This may have already been obvious to you all).
Quick reminder that this is still happening, even if we don't see it in normal CI (because of flake retries). Last 60 days:
Seen in: podman/remote fedora-37/fedora-38/fedora-39/rawhide root/rootless container/host boltdb/sqlite
ARGH! I added exec -t
tests in system tests -- which do not have FlakeRetry -- and, blam.
This is a horrible flake. It is going to start hitting us once a day or more. Is there any chance at all that someone can fix it? Pretty please??? The only other choice is to remove the exec -t
system test.
Here's a surprise: same thing, but on run, not exec f39 root container:
# podman [options] run -dt quay.io/libpod/alpine:latest
Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...
Here's a surprise: same thing, but on run, not exec f39 root container:
# podman [options] run -dt quay.io/libpod/alpine:latest Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...
I don't run/exec matter much, also this error can be triggered in many ways it is juts a symptom of conmon exiting or crashing without sending us the proper data we want AFAICT
$ podman --conmon /usr/bin/false run --rm quay.io/libpod/testimage:20240123 true
Error: conmon failed: exit status 1
$ podman --conmon /usr/bin/true run --rm quay.io/libpod/testimage:20240123 true
Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...
Observation, possibly meaningless: this is triggering often in parallel bats tests. That's unusual: this has almost always been an e2e flake.
x | x | x | x | x | x |
---|---|---|---|---|---|
int(4) | podman(7) | debian-13(4) | root(7) | host(7) | sqlite(6) |
sys(4) | remote(1) | fedora-39(3) | rootless(1) | container(1) | boltdb(2) |
fedora-40(1) |
I would assume it correlates to the system load, given you run in parallel now scheduling delays with be most likely be higher between the individual threads/processes thus detecting more race conditions.
Looking in the journal for the container name I found this:
Aug 27 15:37:07 cirrus-task-6657029832441856 podman[135965]: 2024-08-27 15:37:07.90776318 -0500 CDT m=+0.096462576 container exec f079d2c2b5726194b21cfbf446052e3ec8cece8f28f0723652bc196d4c0759c5 (image=quay.io/libpod/testimage:20240123, name=c--explicit_RUN_term-t148-3n5hvd07, created_at=2024-01-24T18:38:28Z, created_by=test/system/build-testimage, io.buildah.version=1.34.1-dev)
Aug 27 15:37:07 cirrus-task-6657029832441856 conmon[135999]: conmon f079d2c2b5726194b21c <nwarn>: Failed to open cgroups file: /proc/136012/cgroup
Aug 27 15:37:07 cirrus-task-6657029832441856 conmon[135999]: conmon f079d2c2b5726194b21c <nwarn>: Failed to get cgroup path. Container may have exited
Aug 27 15:37:07 cirrus-task-6657029832441856 conmon[135999]: conmon f079d2c2b5726194b21c <nwarn>: Failed to open cgroups file:
Aug 27 15:37:07 cirrus-task-6657029832441856 podman[135965]: 2024-08-27 15:37:07.93562878 -0500 CDT m=+0.124328176 container exec_died f079d2c2b5726194b21cfbf446052e3ec8cece8f28f0723652bc196d4c0759c5 (image=quay.io/libpod/testimage:20240123, name=c--explicit_RUN_term-t148-3n5hvd07, created_at=2024-01-24T18:38:28Z, created_by=test/system/build-testimage, io.buildah.version=1.34.1-dev)
and also this:
Aug 27 15:37:07 cirrus-task-6657029832441856 conmon[135892]: conmon f079d2c2b5726194b21c <error>: Runtime did not set up terminal
I wonder if this is related: https://github.com/containers/crun/issues/1524
Common thread seems to be:
Podman exec [It] podman exec terminal doesn't hang
And also just now in a still-live PR (my flake-xref does not handle live PRs): int podman ubuntu-2104 root host
Note: the March and April logs above have been garbagecollected, I can't confirm that the error is the same one. I'm leaving them in the report deliberately, in case it helps to have a timestamp for the start of this flake (i.e. it might not be new in June).
Edit: this is podman, not podman-remote, so it's unlikely to be the same as #7360