Closed goochjj closed 3 years ago
I'm fairly certain this is because runc
/crun
themselves basically freeze midway through container creation until they are notified by SD_NOTIFY. Podman is waiting for final confirmation from the runtime that the container has been successfully created, and it holding the container lock for the container in question while doing so. Changing that may be possible (it would unblock some things like podman ps
) but the container is not considered running until the runtime finishes up, so we won't be able to podman exec
into the container until after the notify issue has sorted itself.
First fix #6689 - Advertise MAINPID via sd-notify, this eliminates the need for a PIDFile.
In addition, I think there should be a new option to podman run, something like:
--sdnotify conmon-only|container
With conmon-only, we send the MAINPID, and clear the NOTIFY_SOCKET so the OCI runtime doesn't pass it into the container. We could also advertise "ready" when the OCI runtime finishes to advertise the service as ready.
With container, we send the MAINPID, and leave the NOTIFY_SOCKET so the OCI runtime passes it into the container for initialization, and let the container advertise further metadata.
This removes the need for hardcoded CID and PID files in the command line, and the PIDFile directive, as the pid is advertised directly through sd-notify.
I'm fairly certain this is because
runc
/crun
themselves basically freeze midway through container creation until they are notified by SD_NOTIFY. Podman is waiting for final confirmation from the runtime that the container has been successfully created, and it holding the container lock for the container in question while doing so. Changing that may be possible (it would unblock some things likepodman ps
) but the container is not considered running until the runtime finishes up, so we won't be able topodman exec
into the container until after the notify issue has sorted itself.
I'm less concerned with operations on the container in question (i.e. exec into the actual container) than I am things completely unrelated like podman info
. Could always use nsenter to poke around the container and troubleshoot.
@vrothberg @giuseppe see --sdnotify proposal above
I did MAINPID only since I thought that was least intrusive. With the above, "Container" would be the default as that's what it does now.
We could add "none" as well, to tell podman not to do anything with the NOTIFY_SOCKET. I just didn't see how systemd would get the correct PID from the container itself... Looks like runc attempts to send a MAINPID, and if it overrides us there's nothing we can do about it anyway... but with testing, whatever runc is sending doesn't work.
Thanks for the ping, @goochjj!
@giuseppe knows these details of the runtimes and I guess he will know what to do :)
I'm fairly certain this is because
runc
/crun
themselves basically freeze midway through container creation until they are notified by SD_NOTIFY. Podman is waiting for final confirmation from the runtime that the container has been successfully created, and it holding the container lock for the container in question while doing so. Changing that may be possible (it would unblock some things likepodman ps
) but the container is not considered running until the runtime finishes up, so we won't be able topodman exec
into the container until after the notify issue has sorted itself.
yes I confirm that. Both crun/runc stops on "$RUNTIME start" until the container notifies them.
Why does the container take so long to be ready? Does it actually send any READY=1
notification?
and also, any other info sent by the container is ignored, e.g. MAINPID
. The only notification accepted by the runtimes is READY=1
No, I'm purposely delaying it to test behavior. I don't think we can count on the container init being quick.
For instance, if I were to use this for my tomcat based web apps, it may take 30s-2min to initialize, and then I'd run health checks... I wouldn't want all of podman to be locked while that's happening.
and also, any other info sent by the container is ignored, e.g.
MAINPID
. The only notification accepted by the runtimes isREADY=1
@giuseppe Am I reading this wrong? It looks like at line 166 runc itself is sending the MAINPID. https://github.com/opencontainers/runc/blob/9748b48742fad065666bcded00f2db5ded7e44c1/notify_socket.go#L140-L170
If runc does send it, it's not making it to systemd, or systemd is throwing it out - at least in my testing.
``
@giuseppe Am I reading this wrong? It looks like at line 166 runc itself is sending the MAINPID. https://github.com/opencontainers/runc/blob/9748b48742fad065666bcded00f2db5ded7e44c1/notify_socket.go#L140-L170
yes correct, runc is sending the MAINPID. The MAINPID sent by runc is the pid for the container
For instance, if I were to use this for my tomcat based web apps, it may take 30s-2min to initialize, and then I'd run health checks... I wouldn't want all of podman to be locked while that's happening.
the assumption is that the container is not ready until it notifies to be so. We should probably fix the healthchecks to raise an error immediately if the container is not running.
``
@giuseppe Am I reading this wrong? It looks like at line 166 runc itself is sending the MAINPID. https://github.com/opencontainers/runc/blob/9748b48742fad065666bcded00f2db5ded7e44c1/notify_socket.go#L140-L170
yes correct, runc is sending the MAINPID. The MAINPID sent by runc is the pid for the container
Except it doesn't seem to be....
runc --version
runc version 1.0.0-rc10+dev
commit: 2a0466958d9af23af2ad12bd79d06ed0af4091e2
spec: 1.0.2-dev
One screen: socat unix-recvfrom:/run/nottest,fork -
Another screen:
env NOTIFY_SOCKET=/run/nottest podman --log-level debug run --rm -it alpine sh
Another screen for i in /run/crun/*/notify/notify /run/runc/*/notify/notify.sock; do env NOTIFY_SOCKET=$i systemd-notify --ready; done
All that socat shows is "READY=1"
I don't know what the tinker.C part means, maybe it's a spec option or a command line option?
It appears that systemd DOES accept MAINPID multiple times. Each time MAINPID is sent it switches, throwing a cgroups warning if need be. (runc start is apparently forked off, not surprisingly)
I'm fine sending MAINPID at conmon launch, and again after runc start succeeds to reset it if the OCI runtime played tricks on us.
For instance, if I were to use this for my tomcat based web apps, it may take 30s-2min to initialize, and then I'd run health checks... I wouldn't want all of podman to be locked while that's happening.
the assumption is that the container is not ready until it notifies to be so. We should probably fix the healthchecks to raise an error immediately if the container is not running.
I 100% agree with that. I also don't care if podman exec -it container_in_question blocks (except that ctrl-c doesn't work - that should probably be fixed). I DO have an issue with podman info blocking, and podman ps. Wasn't the point of this architecture vs. docker that containers were more independent and less susceptible to this sort of coupling between containers...
We take and hold a lock on the container in question for all major operations. This provides strong safety against races (two podman start
on the same container launched at the same time will not be an issue). podman ps
also grabs the lock (to get the current status of the container) and it looks like info does the same? Normally this isn't a problem, because every one of the locked operations takes less than a second, and then the lock releases and goes to another process. Unfortunately, the decision by the OCI runtime to not exit until SD_NOTIFY has succeeded means that podman start
can hang for a very long time in that case. I'm still of the opinion that this is an OCI runtime issue, but it doesn't seem like one that runc
has any intention of addressing.
@mheon If you don't like how runc handles it, then it could be taken out of the OCI runtime's responsibility.
1) Have podman mask out NOTIFY_SOCKET from the OCI runtime, but pass it to conmon
2) Have conmon create a notify socket, intercept READY and pass it on to the host's notify socket, like runc/crun do now.
3) Have podman inject the bind mount to the socket conmon is proxying, and set NOTIFY_SOCKET properly in the container's env (not the runtime's env)
I imagine that would do the trick... But it seems hacky to implement behavior in podman because the OCI runtime doesn't do what we want.
It depends on what you think should happen - when you do a podman start, and the container isn't READY, has it started? I'd say yes. Because sd-notify is specifically meant to talk to SYSTEMD, not oci runtimes. The container started.
If you want to verify it's READY, you should use systemctl status to verify it's ready. Or, monitor the unix datagram yourself. IMHO the runtime and podman shouldn't block waiting for READY=1.
That is quite tempting, for more than one reason - we've had occasional problems with the implementation in runc
having bugs and upstream not merging fixes for said bugs, so entirely removing this from the runtime lets us do this in a more sensible, nonblocking way and circumvent buggy runtimes.
I am a bit concerned about the complexity this would bring to conmon, and the difficulty of modifying the environment of the running container (going to be difficult doing that in a portable fashion).
That is quite tempting, for more than one reason - we've had occasional problems with the implementation in
runc
having bugs and upstream not merging fixes for said bugs, so entirely removing this from the runtime lets us do this in a more sensible, nonblocking way and circumvent buggy runtimes.I am a bit concerned about the complexity this would bring to conmon, and the difficulty of modifying the environment of the running container (going to be difficult doing that in a portable fashion).
I imagine it would end up in specgen somewhere - injecting into the container's ENV and Mounts. This is very similar to what I was playing with w.r.t. syslog proxying.
https://github.com/containers/conmon/pull/177/files
It was much longer than I thought it should be mainly because it's C and it's glib and it's nonblocking... And everything in conmon so far was based on stream sockets. If conmon gained the ability to do dgram sockets, it could do specific proxying for NOTIFY_SOCKET and generic proxying for things like /dev/log.
Ultimately I replaced it with an OCI hook shell script that was about 20 lines long, which seemed less complex.
I have less understanding of how best to inject mounts and env in libpod - perhaps in specgen somewhere... It would just be another bind mount that libpod handles (/etc/hosts, /etc/resolv.conf, /run/.containerenv, /run/notify) and it would bind to the exact same directory - userdata. It could even always be /run/notify in the container.
Huh. I was under the impression that the OCI runtime automatically hijacked SD_NOTIFY when it found it - if that's not the case, and we can do it ourselves, that's not bad at all. My main fear was that we'd have to trick the runtime into thinking there was no socket.
Well, let's test. We all know what happens when runc and crun receives NOTIFY_SOCKET in its own environment. If runc and crun also intercepted the spec's environment... then this wouldn't happen:
podman run --rm -it -e NOTIFY_SOCKET=/tmp/notify alpine sh
/ # printenv NOTIFY_SOCKET
/tmp/notify
/ # ls -l /tmp
total 0
Which means I can do things myself with my own bind mount....
socat unix-recvfrom:/tmp/notify - &
# podman run --rm -it -e NOTIFY_SOCKET=/tmp/notify -v /tmp/notify:/tmp/notify fedora sh
sh-5.0# systemd-notify --ready
READY=1sh-5.0#
That's runc...
Crun:
# podman --runtime /opt/podman/bin/crun run --rm -it -e NOTIFY_SOCKET=/tmp/notify alpine sh
/ # ls -l /tmp
total 0
/ # printenv NOTIFY_SOCKET
/tmp/notify
/ #
# podman --runtime /opt/podman/bin/crun run --rm -it -e NOTIFY_SOCKET=/tmp/notify -v /tmp/notify:/tmp/notify fedora sh
sh-5.0# systemd-notify --ready
READY=1sh-5.0# exit
Also runc and crun only ever pass READY=1, but I don't see any reason why we couldn't also pass STATUS,RELOADING, STOPPING, ERRNO, or WATCHDOG items.. I assume they're not passed now just because they aren't frequently implemented. I think we'd just need to decide which variables conmon owns (MAINPID being the obvious one. conmon could block MAINPID from the container and only ever send its own). Does it make sense for conmon to set up the watchdog for itself? etc.
I can work up a PoC and we can go from there - probably the first task would be to just get the 3 modes I did in the PR working without the OCI runtime doing any proxying, and only dealing with MAINPID and READY=1. Future expansion could build off of that. Doing it that way should solve the podman locking "problem" introduced by the OCI runtime.
@mheon Given https://github.com/containers/conmon/pull/182, the only changes to libpod are https://github.com/goochjj/libpod/pull/1 to handle SD-NOTIFY ourselves
(assuming the other sd-notify work as a baseline)
A friendly reminder that this issue had no activity for 30 days.
@goochjj @mheon @giuseppe Is this still an issue?
Yes, it is still an issue. The OCI runtime blocks when using sd-notify (container), and podman freezes because the starting container holds the lock.
I've provided a solution. containers/conmon#182 goochjj#1
@goochjj @mheon @giuseppe Is this still an issue?
I don't think it is an issue. Podman and the OCI runtime wait for the container to be ready. If the container doesn't support SD_NOTIFY, then the user must not set it.
And if the OCI runtime waits for 3-5 minutes for the container to start, and podman is completely unusable during that time, even for OTHER containers, you don't consider than an issue? podman ps
and podman info
and podman exec
should all just freeze in place?
What's the design? What's the guiding principles?
In my mind:
1) One container should never influence operations on another container.
2) One container's startup should never freeze podman. Especially for non-container-specific operations like podman info
If these principles are incorrect, please let me know. If these principles are correct, then the current SD_NOTIFY support violates both. And I HAVE provided a solution. So the only questions are 1) Is my solution the right direction? 2) Are you going to merge them?
Making podman info
query container status was a mistake in retrospect - makes it too vulnerable to these sorts of lock contention issues, and we need it to function for debugging. We could go in and add a timeout on retrieving status, return without it if it takes too long - but that's really a separate issue.
When we originally wrote Podman, one of the core assumptions was that the OCI runtime would exit quickly. As such, every core operation on containers that involves it holds the container lock for the full duration - which is much safer (ensures that podman start
and podman stop
on the same container can never race). Unfortunately, this assumption blows up the moment sd-notify comes into play, as the OCI runtime decides that hanging for several minutes without returning is a quite acceptable thing and Podman is stuck holding locks and hoping the container actually starts successfully.
I would really rather solve this outside of the OCI runtime (the Conmon patches from @goochjj seem like a good fit) and preserve our existing semantics, than trying to patch around an OCI runtime that doesn't tell me if a container started successfully for several minutes.
@giuseppe Do you have any objection to us supporting a separate, Conmon-based path? The OCI runtime approach is probably fine for other users but I don't think it's a good fit for Podman's architecture.
if we don't care for the oci runtime to start, couldn't we just run the "start" in the background?
I'd say we should not keep any lock during start
. What kind of issues could that introduce? Could a new state "starting" solve them?
I'm really reluctant to do that - what if the OCI runtime fails to start (less likely on runc/crun admittedly, but I could easily see Kata or similar blowing up at start even if create succeeded earlier). We'd stick the container into an awkward "starting" state where it could potentially be removed from under us while the OCI runtime is trying to start it and hope that start succeeds in a halfway-reasonable time, while podman start
is stuck and giving no output on the command line.
I think we've agreed to move forward with the conmon-based approach, with the option to use a flag to retain the existing behavior.
Which means we need to finally get that Conmon patch merged - sorry for the delay there.
Any progress on this @goochjj ?
It's all in #7607 - which was pending conmon's changes making it into the test environments.
Any idea if the new conmon is there now?
Still waiting for this to get merged.
A friendly reminder that this issue had no activity for 30 days.
Still waiting on this one.
A friendly reminder that this issue had no activity for 30 days.
Since Conmon is merged, I am going to assume this is fixed. Reopen if I am mistaken.
Are you sure this is fixed with conmon, only? Don't we need #8508 too? I'm asking, because I still see this behaviour on F34 with
I'm trying to get a sd_notify aware service in a podman container to play nicely as a Type=notify
systemd unit, preferably with watchdog and status text. I would love this kind of "perfect" integration.
Do you have a easy reproducer? We need an example of a program that uses sd-notify that we could try.
I just threw my unpolished experiment into a repo: https://github.com/flyingflo/podman_sdnotify
It also works with a bash script calling systemd-notify
, but I used python later, because the PID stays the same.
I am not sure how this is supposed to work, but the first example I tried got connection refused, with container, then I changed to conmon type and I get.
× sdnotify_py.service - Podman container-sdnotify_py_ctr.service
Loaded: loaded (/usr/lib/systemd/system/sdnotify_py.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Fri 2021-07-16 08:57:12 EDT; 28s ago
Docs: man:podman-generate-systemd(1)
Process: 186726 ExecStartPre=/bin/rm -f /run/container-sdnotify_py_ctr.pid /run/container-sdnotify_py_ctr.ctr-id (code=exited, status=0/SUCCESS)
Process: 186897 ExecStart=/usr/bin/podman run --conmon-pidfile /run/container-sdnotify_py_ctr.pid --cidfile /run/container-sdnotify_py_ctr.ctr-id --cgroups=split --sdnotify=conmon -d --replace -a stderr -a stdout --log-driver=journal>
Process: 187055 ExecStopPost=/usr/bin/podman rm --ignore -f --cidfile /run/container-sdnotify_py_ctr.ctr-id (code=exited, status=219/CGROUP)
Main PID: 186897 (code=exited, status=1/FAILURE)
CPU: 566ms
Jul 16 08:57:12 localhost.localdomain conmon[186897]: File "/app/sdnotify_py.py", line 15, in <module>
Jul 16 08:57:12 localhost.localdomain conmon[186897]: n = sdnotify.SystemdNotifier(debug=True)
Jul 16 08:57:12 localhost.localdomain conmon[186897]: File "/usr/local/lib/python3.9/site-packages/sdnotify/__init__.py", line 37, in __init__
Jul 16 08:57:12 localhost.localdomain conmon[186897]: if addr[0] == '@':
Jul 16 08:57:12 localhost.localdomain conmon[186897]: TypeError: 'NoneType' object is not subscriptable
Jul 16 08:57:12 localhost.localdomain podman[186926]: 2021-07-16 08:57:12.67782836 -0400 EDT m=+0.058939985 container died ceee2ce16ccfd607d76ad7eecfcb487ed3cb84a990dbe5f2a04232b49ddf535c (image=localhost/sdnotify_py:latest, name=sdnotif>
Jul 16 08:57:12 localhost.localdomain podman[186926]: 2021-07-16 08:57:12.771301398 -0400 EDT m=+0.152413006 container cleanup ceee2ce16ccfd607d76ad7eecfcb487ed3cb84a990dbe5f2a04232b49ddf535c (image=localhost/sdnotify_py:latest, name=sdn>
Jul 16 08:57:12 localhost.localdomain systemd[1]: sdnotify_py.service: Main process exited, code=exited, status=1/FAILURE
Jul 16 08:57:12 localhost.localdomain systemd[1]: sdnotify_py.service: Control process exited, code=exited, status=219/CGROUP
Jul 16 08:57:12 localhost.localdomain systemd[1]: sdnotify_py.service: Failed with result 'exit-code'.
@haircommander Any idea what is going on here. It looks like conmon is not listening on the notify socket?
gonna be honest, I do not understand the sd notify code haha
@flyingflo are you able to test using that PR? the phrasing "blocked by oci runtime" in the PR implies there is a missing link that we may need...
Actually @flyingflo your example is working for me. Although you python script is failing after the first sdnotify.
# Inform systemd that we've finished our startup sequence...
n = sdnotify.SystemdNotifier(debug=True)
n.notify("READY=1\nSTATUS=Started")
This looks successful, then systemd or conmon drops the connection
count = 1
while True:
print("Running... {}".format(count))
try:
n.notify("STATUS=Count is {}".format(count))
This notify fails, because systemd or conmon has dropped the connection.
Does this test work differently if you don't run it in a container. Is conmon dropping the connection when it should not?
With my test, I can clearly observe the behaviour in this actual issue.
The crun process sticks around (observed with systemctl status
) until the python script sends READY after 10s.
During this time podman is blocked completely (e.g. podman ps
), which is what this issue is about, if I understood.
The second observation is that all writes to the socket fail, after the first write with READY=1
, after which crun terminates.
I believe this is because what I see is crun proxying the socket, not conmon.
The intention of @goochjj in #8508 and https://github.com/containers/conmon/pull/182 is that conmon proxies the socket and that the OCI runtime is circumvented.
I'm not sure if that is correct, because I'm looking into this the first time, but I think we need #8508 before https://github.com/containers/conmon/pull/182 is in effect with podman, because it prepares the environments for conmon and the OCI runtime appropriately.
Wait - #8508 isn't merged yet?
@vrothberg Didn't we just swap generate systemd
to use notify by default? If so, and #8508 is not merged, I think we need an immediate revert of that. The behavior of sdnotify when the OCI runtime is used is not sane.
Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)
/kind bug
Description
Creating a service using Type=notify freezes everything until the container is ready.
Setting a service as Type=notify sets the NOTIFY_SOCKET, which gets passed through podman properly. Runc and Crun then proxy that NOTIFY_SOCKET through to the container, indicating the container will signal when ready. The whole idea being that starting a container does not equal "a container is ready", this initialization could take seconds, 10's of seconds, or minutes. And it shouldn't matter how long it takes..
The problem is while it's in "starting" status, podman is frozen completely.
podman ps
doesn't return,podman exec
doesn't work, evenpodman info
won't return. One partially initialized container shouldn't freeze everything, and the lack of exec makes it hard to diagnose what's going on inside the container to resolve the sd-notify issue.podman stop
andpodman kill
appear to work, but the container is still stuck.In addition, the MAINPID isn't set right - but we'll come back to that.
Steps to reproduce the issue:
[Service] Environment=PODMAN_SYSTEMD_UNIT=%n SyslogIdentifier=%N ExecStartPre=-/usr/bin/podman stop %N ExecStartPre=-/usr/bin/podman rm %N LogExtraFields=CONTAINER_NAME=%N ExecStart=/usr/bin/podman --log-level=debug run \ -d --log-driver=journald \ --init \ --cgroups no-conmon \ --net=host \ --name %N \ alpine sleep infinity ExecStop=/usr/bin/podman stop -t 20 %N Type=notify NotifyAccess=all Restart=on-failure
Restart=always
RestartSec=30s TimeoutStartSec=20 TimeoutStopSec=25
KillMode=none
Type=forking
PIDFile=/run/podman-pid-%n
Delegate=yes Slice=machine.slice
[Install] WantedBy=multi-user.target default.target
notifytest[8595]: time="2020-06-19T13:22:38Z" level=debug msg="Starting container e6043f58bcd610d1e448739f2120447f2880c9b498c65fc3c181e1f453a48ef7 with command
Main PID: 12003 (podman) Tasks: 22 (limit: 4915) Memory: 27.9M CGroup: /machine.slice/notifytest.service ├─12003 /usr/share/gocode/src/github.com/containers/libpod/bin/podman --log-level=debug run -d --log-driver=journald --init --cgroups no-conmon --net=host --name notifytest alpine sleep infinity ├─12065 /usr/libexec/podman/conmon --api-version 1 -c a3e79ea772bdcca69020eca158f059718ff0f4b34dd1c8f8af5e77c6840e60f0 -u a3e79ea772bdcca69020eca158f059718ff0f4b34dd1c8f8af5e77c6840e60f0 -r /usr/bin/runc -b /va> └─12084 /usr/bin/runc start a3e79ea772bdcca69020eca158f059718ff0f4b34dd1c8f8af5e77c6840e60f0
Version: 2.0.0-dev API Version: 1 Go Version: go1.13.3 Git Commit: b27df834c18b08bb68172fa5bd5fd12a5cd54633 Built: Thu Jun 18 12:19:01 2020 OS/Arch: linux/amd64
host: arch: amd64 buildahVersion: 1.15.0 cgroupVersion: v1 conmon: package: Unknown path: /usr/libexec/podman/conmon version: 'conmon version 2.0.18-dev, commit: 954b05a7908c0aeeff007ebd19ff662e20e5f46f' cpus: 4 distribution: distribution: photon version: "3.0" eventLogger: file hostname: photon-machine idMappings: gidmap: null uidmap: null kernel: 4.19.115-6.ph3-esx linkmode: dynamic memFree: 5536968704 memTotal: 8359960576 ociRuntime: name: runc package: runc-1.0.0.rc9-2.ph3.x86_64 path: /usr/bin/runc version: |- runc version 1.0.0-rc10+dev commit: 2a0466958d9af23af2ad12bd79d06ed0af4091e2 spec: 1.0.2-dev os: linux remoteSocket: path: /run/podman/podman.sock rootless: false slirp4netns: executable: "" package: "" version: "" swapFree: 0 swapTotal: 0 uptime: 25h 9m 27.25s (Approximately 1.04 days) registries: search:
Package info (e.g. output of
rpm -q podman
orapt list podman
):Compiled from source.
Additional environment details (AWS, VirtualBox, physical, etc.):