Closed choeffer closed 4 years ago
Can you try the commands in the systemd (the ExecStartPre and ExecStart ones most notably) manually, with --log-level=debug
added, and provide the logs? Should help us identify what is going on
@vrothberg PTAL
Thanks for the fast reply. The ExecStartPre and ExecStart on their own are working fine in both cases.
[fedora@fedora-31-cloud-podman user]$ /usr/bin/rm -f //run/user/1000/redhat_ex.service-pid //run/user/1000/redhat_ex.service-cid
[fedora@fedora-31-cloud-podman user]$ /usr/bin/podman run --conmon-pidfile //run/user/1000/redhat_ex.service-pid --cidfile //run/user/1000/redhat_ex.service-cid -d alpine:latest top --log-level=debug
6ee53e0507479f608f9c6fe4a6200b558a94dd90866db5240ef081454a89dd78
[fedora@fedora-31-cloud-podman user]$
Interestingly, the same .service file is running fine as systemd system service.
[fedora@fedora-31-cloud-podman systemd]$ sudo cp /home/fedora/.config/systemd/user/redhat_ex.service /etc/systemd/system/
[fedora@fedora-31-cloud-podman systemd]$ sudo systemctl daemon-reload
[fedora@fedora-31-cloud-podman systemd]$ sudo systemctl start redhat_ex.service
[fedora@fedora-31-cloud-podman systemd]$ sudo systemctl status redhat_ex.service
● redhat_ex.service - Podman in Systemd
Loaded: loaded (/etc/systemd/system/redhat_ex.service; disabled; vendor preset: disabled)
Active: active (running) since Fri 2020-01-10 16:12:58 UTC; 8s ago
Process: 1329 ExecStartPre=/usr/bin/rm -f //run/redhat_ex.service-pid //run/redhat_ex.service-cid (code=exited, status=0/SUCCESS)
Process: 1330 ExecStart=/usr/bin/podman run --conmon-pidfile //run/redhat_ex.service-pid --cidfile //run/redhat_ex.service-cid -d alpine:latest top (code=exited, status=0/SUCCESS)
Main PID: 1462 (conmon)
Tasks: 0 (limit: 4678)
Memory: 33.0M
CPU: 892ms
CGroup: /system.slice/redhat_ex.service
‣ 1462 /usr/bin/conmon --api-version 1 -s -c 3b7eb5fd0547f8ac67b8106a3ac51cd2c4a9be8ced6bafd9850408c47399bce6 -u 3b7eb5fd0547f8ac67b8106a3ac51cd2c4a9be8ced6bafd9850408c47399bce6 -r /usr/bin/crun -b /var/lib/containers/storage/overlay-containers/3b7eb5fd0547f8ac67b8106a3ac51cd2c4a9be8ced6bafd9850408c47399b>
Jan 10 16:12:58 fedora-31-cloud-podman.novalocal systemd[1]: redhat_ex.service: Child 1330 belongs to redhat_ex.service.
Jan 10 16:12:58 fedora-31-cloud-podman.novalocal systemd[1]: redhat_ex.service: Control process exited, code=exited, status=0/SUCCESS
Jan 10 16:12:58 fedora-31-cloud-podman.novalocal systemd[1]: redhat_ex.service: Got final SIGCHLD for state start.
Jan 10 16:12:58 fedora-31-cloud-podman.novalocal systemd[1]: redhat_ex.service: New main PID 1462 does not belong to service, but we'll accept it since PID file is owned by root.
Jan 10 16:12:58 fedora-31-cloud-podman.novalocal systemd[1]: redhat_ex.service: Main PID loaded: 1462
Jan 10 16:12:58 fedora-31-cloud-podman.novalocal systemd[1]: redhat_ex.service: Changed start -> running
Jan 10 16:12:58 fedora-31-cloud-podman.novalocal systemd[1]: redhat_ex.service: Job 604 redhat_ex.service/start finished, result=done
Jan 10 16:12:58 fedora-31-cloud-podman.novalocal systemd[1]: Started Podman in Systemd.
Jan 10 16:12:58 fedora-31-cloud-podman.novalocal systemd[1]: redhat_ex.service: Failed to send unit change signal for redhat_ex.service: Connection reset by peer
Jan 10 16:12:58 fedora-31-cloud-podman.novalocal systemd[1]: redhat_ex.service: Control group is empty.
@choeffer, can you share the service file or provide one as a reproducer?
This is the one I am using:
[Unit]
Description=Podman in Systemd
[Service]
Restart=no
ExecStartPre=/usr/bin/rm -f /%t/%n-pid /%t/%n-cid
ExecStart=/usr/bin/podman run --conmon-pidfile /%t/%n-pid --cidfile /%t/%n-cid -d alpine:latest top
ExecStop=/usr/bin/sh -c "/usr/bin/podman rm -f `cat /%t/%n-cid`"
KillMode=none
Type=forking
PIDFile=/%t/%n-pid
[Install]
WantedBy=multi-user.target
I can't reproduce. It works on my machine. I am on cgroups v1 though.
It is related to cgroups v2. I can reproduce the issue locally, see below:
systemd[1426]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
systemd[1426]: alpine.service: Found left-over process 5227 (fuse-overlayfs) in control group while starting unit. Ignoring.
systemd[1426]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
systemd[1426]: alpine.service: Found left-over process 5230 (slirp4netns) in control group while starting unit. Ignoring.
systemd[1426]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
... AND ...
systemd[1426]: alpine.service: New main PID 5233 does not belong to service, and PID file is not owned by root. Refusing.
It works when setting --cgroups=disabled
in podman-run. @giuseppe, do you know what's going on?
Can confirm, with --cgroups=disabled
it also works with podman 1.7.0 again.
[fedora@fedora-31-cloud-podman user]$ cat redhat_ex.service
[Unit]
Description=Podman in Systemd
[Service]
Restart=no
ExecStartPre=/usr/bin/rm -f /%t/%n-pid /%t/%n-cid
ExecStart=/usr/bin/podman run --cgroups=disabled --conmon-pidfile /%t/%n-pid --cidfile /%t/%n-cid -d alpine:latest top
ExecStop=/usr/bin/sh -c "/usr/bin/podman rm -f `cat /%t/%n-cid`"
KillMode=none
Type=forking
PIDFile=/%t/%n-pid
[Install]
WantedBy=multi-user.target
[fedora@fedora-31-cloud-podman user]$ systemctl --user status redhat_ex.service
● redhat_ex.service - Podman in Systemd
Loaded: loaded (/home/fedora/.config/systemd/user/redhat_ex.service; disabled; vendor preset: enabled)
Active: active (running) since Fri 2020-01-10 16:46:00 UTC; 4min 21s ago
Process: 1573 ExecStartPre=/usr/bin/rm -f //run/user/1000/redhat_ex.service-pid //run/user/1000/redhat_ex.service-cid (code=exited, status=0/SUCCESS)
Process: 1574 ExecStart=/usr/bin/podman run --cgroups=disabled --conmon-pidfile //run/user/1000/redhat_ex.service-pid --cidfile //run/user/1000/redhat_ex.service-cid -d alpine:latest top (code=exited, status=0/SUCCESS)
Main PID: 1592 (conmon)
Tasks: 8 (limit: 4678)
Memory: 77.4M
CPU: 406ms
CGroup: /user.slice/user-1000.slice/user@1000.service/redhat_ex.service
├─ 720 /usr/bin/podman
├─1263 /usr/bin/fuse-overlayfs -o lowerdir=/home/fedora/.local/share/containers/storage/overlay/l/RELFBA7TL6P5UYS2YRPZYUHHQV,upperdir=/home/fedora/.local/share/containers/storage/overlay/dad442dadfc9880c972793590dbec2420c50e1192bc8c038ebf23ec6fad17e13/diff,workdir=/home/fedora/.local/share/containers/stor>
├─1266 /usr/bin/slirp4netns --disable-host-loopback --mtu 65520 -c -e 3 -r 4 --netns-type=path /run/user/1000/netns/cni-0f5ef442-e2e6-af7d-723f-6d950eea572a tap0
├─1587 /usr/bin/fuse-overlayfs -o lowerdir=/home/fedora/.local/share/containers/storage/overlay/l/RELFBA7TL6P5UYS2YRPZYUHHQV,upperdir=/home/fedora/.local/share/containers/storage/overlay/43130682adb50ba6d984dd14e06c656ac9617f27adff49b17562e96ced1a4d59/diff,workdir=/home/fedora/.local/share/containers/stor>
├─1588 /usr/bin/slirp4netns --disable-host-loopback --mtu 65520 -c -e 3 -r 4 --netns-type=path /run/user/1000/netns/cni-f6f876b7-76c3-ea21-a122-ed6ab14bd9d2 tap0
├─1592 /usr/bin/conmon --api-version 1 -c 31f38e9bac3798edf6f9c6c4500d04940d48ab1736c78606086c4eb49eb0ccc2 -u 31f38e9bac3798edf6f9c6c4500d04940d48ab1736c78606086c4eb49eb0ccc2 -r /usr/bin/crun -b /home/fedora/.local/share/containers/storage/overlay-containers/31f38e9bac3798edf6f9c6c4500d04940d48ab1736c7860>
└─1598 top
Jan 10 16:46:00 fedora-31-cloud-podman.novalocal systemd[658]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jan 10 16:46:00 fedora-31-cloud-podman.novalocal systemd[658]: redhat_ex.service: Found left-over process 1263 (fuse-overlayfs) in control group while starting unit. Ignoring.
Jan 10 16:46:00 fedora-31-cloud-podman.novalocal systemd[658]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jan 10 16:46:00 fedora-31-cloud-podman.novalocal systemd[658]: redhat_ex.service: Found left-over process 1266 (slirp4netns) in control group while starting unit. Ignoring.
Jan 10 16:46:00 fedora-31-cloud-podman.novalocal systemd[658]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jan 10 16:46:00 fedora-31-cloud-podman.novalocal podman[1574]: 2020-01-10 16:46:00.450713326 +0000 UTC m=+0.071489446 container create 31f38e9bac3798edf6f9c6c4500d04940d48ab1736c78606086c4eb49eb0ccc2 (image=docker.io/library/alpine:latest, name=agitated_villani)
Jan 10 16:46:00 fedora-31-cloud-podman.novalocal podman[1574]: 2020-01-10 16:46:00.710006256 +0000 UTC m=+0.330782378 container init 31f38e9bac3798edf6f9c6c4500d04940d48ab1736c78606086c4eb49eb0ccc2 (image=docker.io/library/alpine:latest, name=agitated_villani)
Jan 10 16:46:00 fedora-31-cloud-podman.novalocal podman[1574]: 2020-01-10 16:46:00.715430062 +0000 UTC m=+0.336206192 container start 31f38e9bac3798edf6f9c6c4500d04940d48ab1736c78606086c4eb49eb0ccc2 (image=docker.io/library/alpine:latest, name=agitated_villani)
Jan 10 16:46:00 fedora-31-cloud-podman.novalocal podman[1574]: 31f38e9bac3798edf6f9c6c4500d04940d48ab1736c78606086c4eb49eb0ccc2
Jan 10 16:46:00 fedora-31-cloud-podman.novalocal systemd[658]: Started Podman in Systemd.
@choeffer thanks for confirming! We'll have a look at the issue (I'm currently bisecting).
I opened https://github.com/containers/libpod/pull/4835 to revert the change introducing the regression.
@giuseppe, how shall we tackle this? Move slirp and overlayfs in the other cgroups?
@giuseppe, how shall we tackle this? Move slirp and overlayfs in the other cgroups?
I think the correct solution would be to not create a new cgroup at all when running directly from systemd, it is pointless to create a new one and tell systemd to monitor a process from another cgroup.
The issue is that I don't know of any what to detect whether we are running from a systemd service, so we should invent our own way to signal it. It can be a simple environment variable, or if we want to make it more generic, we can add something like --cgroups=enabled-no-conmon
or --cgroups-conmon=disabled
.
What do you think?
I consider this as a regression as it works with pre-1.7 versions. One more F31 user has reached out on IRC today and I would not want to introduce that to RHEL. So I'd love to figure out a way to detect if we're running in a systemd service.
More recent versions of systemd are setting some env variables (see https://serverfault.com/a/927481) but I think that might not work in all circumstances. A service might run a script which runs some containers where Podman is not meant to be the main PID. I think we could look at the parent PID of Podman and check if that's systemd. Would that work?
What happens if the user does something that requires creating a cgroup, though? Sets resource limits on the Podman command line, uses pid=host, etc?
Can't we look at the current cgroup and figure out if it is a service cgroup?
Can't we look at the current cgroup and figure out if it is a service cgroup?
In some cases that might be okay, e.g., when Podman/conmon is not the main PID of the service. Checking for the parent PID is afaics more realiable.
What happens if the user does something that requires creating a cgroup, though? Sets resource limits on the Podman command line, uses pid=host, etc?
It seems redundant to what systemd provides but we must make sure to not regress. Tricky!
What happens if the user does something that requires creating a cgroup, though? Sets resource limits on the Podman command line, uses pid=host, etc?
that should not matter for the conmon cgroup, we don't set any limit there. We will still create a cgroup for the container payload.
That is why I was suggesting another mode for disabling cgroups only for conmon, differently than what we do now with disabled
where the container itself has not a new cgroup.
In some cases that might be okay, e.g., when Podman/conmon is not the main PID of the service. Checking for the parent PID is afaics more realiable.
yes agree. For example on Fedora 31 my console session has this cgroup:
$ cat /proc/self/cgroup
0::/user.slice/user-1000.slice/user@1000.service/dbus\x2d:1.2\x2dcom.gexperts.Tilix.slice/dbus-:1.2-com.gexperts.Tilix@0.service
Ahhh.
Could we just disable creating a distinct cgroup for conmon in rootless entirely? I can't think of any advantages to having it in a distinct cgroup.
Could we just disable creating a distinct cgroup for conmon in rootless entirely? I can't think of any advantages to having it in a distinct cgroup.
Unfortunately we need that as the user creating the container must own the current cgroup, otherwise it won't be possible to move the container process inside the correct cgroup (you must control both the source and destination cgroup). There are cases where that is not true, as the current cgroup is owned by root. We were not creating it on cgroup v1 as anyway we were not able to use cgroups, this was changed for cgroup v2 as we started to use cgroups also for rootless
Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)
/kind bug
Description
Followed https://www.redhat.com/sysadmin/podman-shareable-systemd-services to create rootless container systemd user service. On Fedora 31 and podman 1.6.2 it is working fine, on Fedora 31 and podman 1.7.0 it is failing.
Steps to reproduce the issue:
Create systemd user service, as described in the link above.
Start service.
Podman 1.6.2 works, 1.7.0 fails.
Describe the results you received:
on 1.7.0 I get -> failed (Result: timeout)
Describe the results you expected:
on 1.6.2 I get -> active (running)
Additional information you deem important (e.g. issue happens only occasionally):
Output of
podman version
:and
Output of
podman info --debug
:and
Package info (e.g. output of
rpm -q podman
orapt list podman
):and
Additional environment details (AWS, VirtualBox, physical, etc.):
and
user config:
and
and