Open ttys3 opened 2 years ago
The core issue here, perhaps, is that we trust Conmon (and thus the OCI runtime) to run quickly, such that they can be run in a critical section without blocking the lock for long. This does not seem to be a sane assumption under all circumstances - see this issue, but also the previous way we did sdnotify (which blocked at this point until the container was actually fully up).
We can potentially address this by tightening the container start timeout, which right now is absurdly long (I want to say 10 minutes?), such that we just declare a container as "failed" if it takes longer than, say, 30 seconds; this is still an absurdly long time, but it should be within most timeouts, so things should theoretically proceed. We probably also want to kill Conmon in these cases, so the container doesn't continue to start in the background after we declare it has failed.
This also seems like something we could more easily address with the conmon-rs effort - if we're restructuring Conmon, handling these things on the Conmon side and writing an API that can't block us for multiple minutes when we try to start a container seems like a good addition.
I think we had the same issue for container stop, right? We fixed that by adding the stopping state so we could unlock while we wait for the oci runtime. It should be possible to do something like this for start.
That gets a bit more complicated... What if the user wants to stop a starting container? We'd have to have some sort of way to kill the starting OCI runtime and all associated processes before it's actually fully spun-up.
A friendly reminder that this issue had no activity for 30 days.
In the meantime, do you have any suggestion to workaround the issue, either with a systemd directive to orchestrate stuff a bit better, or anything else? It's kind of handicapping not being able to restart a node when using Nomad :/
Also this issue is hard to debug because there is not much logs
Thank you very much!
@Arno500 here's the real reason why this happend: https://github.com/containers/podman/issues/13081#issuecomment-1053601112
a simple solution is, patch conmon
to with O_NONBLOCK set when it is openning a log
Would it be worth it to make a PR from your fork to nomad-plugin-podman
in this case?
Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)
/kind bug
Description
when there is a conmon process got stucking,
podman ps
command and other command likepodman info
will also got stuck. the rest api also got stuck. the stuck will not resolve if the stuckingconmon
process not get killed.Steps to reproduce the issue:
podman-create-container-via-api.sh
andpodman-start-container-via-api.sh
chmod a+rx podman-create-container-via-api.sh podman-start-container-via-api.sh
ensure you system have
jq
(https://github.com/stedolan/jq) installed, otherwise you have to mod the scripts to avoidjq
create the container, run
Describe the results you received:
2022-02-03 16:58:15 |=================== begin POST /v1.0.0/libpod/containers/0f44d6fb63699f5de959f0f5e8fa0588219ed88bbd6d004028ab88ad14024239/start ===================>>>
curl: (28) Operation timed out after 60001 milliseconds with 0 bytes received start container failed, timeout is 60s now there is one conmon process got stuck forever... root 132608 0.0 0.0 8060 3584 ? S 16:58 0:00 /usr/bin/conmon --api-version 1 -c 0f44d6fb63699f5de959f0f5e8fa0588219ed88bbd6d004028ab88ad14024239 -u 0f44d6fb63699f5de959f0f5e8fa0588219ed88bbd6d004028ab88ad14024239 -r /usr/bin/crun -b /var/lib/containers/storage/overlay-containers/0f44d6fb63699f5de959f0f5e8fa0588219ed88bbd6d004028ab88ad14024239/userdata -p /run/containers/storage/overlay-containers/0f44d6fb63699f5de959f0f5e8fa0588219ed88bbd6d004028ab88ad14024239/userdata/pidfile -n redis-demo --exit-dir /run/libpod/exits --full-attach -s -l k8s-file:/var/lib/containers/redis-demo.fifo --log-level debug --syslog --conmon-pidfile /run/containers/storage/overlay-containers/0f44d6fb63699f5de959f0f5e8fa0588219ed88bbd6d004028ab88ad14024239/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /run/containers/storage --exit-command-arg --log-level --exit-command-arg debug --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /run/libpod --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mountopt=nodev --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg --syslog --exit-command-arg container --exit-command-arg cleanup --exit-command-arg 0f44d6fb63699f5de959f0f5e8fa0588219ed88bbd6d004028ab88ad14024239
Version: 3.4.4 API Version: 3.4.4 Go Version: go1.17.4 Git Commit: f6526ada1025c2e3f88745ba83b8b461ca659933 Built: Fri Dec 10 02:30:40 2021 OS/Arch: linux/amd64
host: arch: amd64 buildahVersion: 1.23.1 cgroupControllers:
Name : podman Version : 3.4.4-1 Description : Tool and library for running OCI-based containers in pods Architecture : x86_64 URL : https://github.com/containers/podman Licenses : Apache Groups : None Provides : None Depends On : cni-plugins conmon containers-common crun fuse-overlayfs iptables libdevmapper.so=1.02-64 libgpgme.so=11-64 libseccomp.so=2-64 slirp4netns Optional Deps : apparmor: for AppArmor support btrfs-progs: support btrfs backend devices [installed] catatonit: --init flag support podman-docker: for Docker-compatible CLI Required By : cockpit-podman podman-compose Optional For : None Conflicts With : None Replaces : None Installed Size : 72.79 MiB Packager : David Runge dvzrv@archlinux.org Build Date : Fri 10 Dec 2021 02:30:40 AM CST Install Date : Thu 03 Feb 2022 02:45:48 AM CST Install Reason : Explicitly installed Install Script : No Validated By : Signature