Containers started using socket-activated APIv2 die from systemd activation timeout

danopia commented 4 years ago

/kind bug

Description

When creating and starting containers specifically via the /run/podman/podman.sock endpoint, the containers are stopped when podman.service gets shut down due to socket activation timeout. This manifests as containers stopping 10s after the last APIv2 request is made. The shutdown is clean; podman ps -a shows exit code 0 for them.

Opening an APIv2 events stream or issuing APIv2 requests every ~5 seconds causes the containers to stay alive longer; but if those requests stop, the containers still stop then.

Steps to reproduce the issue:

Minimal case assuming you have podman's systemd units installed already:

Check you have the socket unit ready: systemctl enable --now podman.socket
Create a sleep container:
1. curl --unix-socket /run/podman/podman.sock -XPOST http://d/v1.0.0/libpod/images/pull'?reference=k8s.gcr.io/pause:3.2'
2. curl --unix-socket /run/podman/podman.sock -XPOST http://d/v1.0.0/libpod/containers/create --header 'content-type: application/json' --data '{"image":"k8s.gcr.io/pause:3.2","name":"bg-test"}'
Start watching events: podman events &
Start the container: curl --unix-socket /run/podman/podman.sock -XPOST http://d/v1.0.0/libpod/containers/bg-test/start

Describe the results you received:

2020-08-11 14:52:21.101145183 +0000 UTC container init 1d1018fa7c028dd9341df4fd26d3b9d04a719ca4b4913da872f9c6d408a831e5 (image=k8s.gcr.io/pause:3.2, name=bg-test)
2020-08-11 14:52:21.114121047 +0000 UTC container start 1d1018fa7c028dd9341df4fd26d3b9d04a719ca4b4913da872f9c6d408a831e5 (image=k8s.gcr.io/pause:3.2, name=bg-test)
[ ... 10 seconds pass ... ]
2020-08-11 14:52:31.215222381 +0000 UTC container died 1d1018fa7c028dd9341df4fd26d3b9d04a719ca4b4913da872f9c6d408a831e5 (image=k8s.gcr.io/pause:3.2, name=bg-test)
2020-08-11 14:52:31.247591086 +0000 UTC container cleanup 1d1018fa7c028dd9341df4fd26d3b9d04a719ca4b4913da872f9c6d408a831e5 (image=k8s.gcr.io/pause:3.2, name=bg-test)

# podman ps -a
CONTAINER ID  IMAGE                 COMMAND  CREATED         STATUS                    PORTS   NAMES
1d1018fa7c02  k8s.gcr.io/pause:3.2           31 seconds ago  Exited (0) 18 seconds ago         bg-test

Describe the results you expected:

2020-08-11 14:52:21.101145183 +0000 UTC container init 1d1018fa7c028dd9341df4fd26d3b9d04a719ca4b4913da872f9c6d408a831e5 (image=k8s.gcr.io/pause:3.2, name=bg-test)
2020-08-11 14:52:21.114121047 +0000 UTC container start 1d1018fa7c028dd9341df4fd26d3b9d04a719ca4b4913da872f9c6d408a831e5 (image=k8s.gcr.io/pause:3.2, name=bg-test)
[ ... nothing further ]

# podman ps -a
CONTAINER ID  IMAGE                 COMMAND  CREATED        STATUS            PORTS   NAMES
1d1018fa7c02  k8s.gcr.io/pause:3.2           2 minutes ago  Up 2 minutes ago          bg-test

Alternative repro within docker:

I also constructed a self-contained repro within docker. This isn't too minimal, but should show the problem on any machine that has Docker available (I'm not sure how relevant the Linux distro is in this issue). You can just copy the whole block and it should show the container dying during the final sleep call.

# bring up a systemd container
docker run --name test -d --privileged --rm --tmpfs /tmp --tmpfs /run --tmpfs /run/lock -v /sys/fs/cgroup:/sys/fs/cgroup:ro jrei/systemd-debian:sid
# install podman (need to ask for runc explicitly because apt apparently prefers crun due to alphabetic order, while podman prefers runc)
docker exec test sh -euxc "apt update; apt install -y podman runc curl | grep podman; systemctl start podman.socket"
# create a sleep container
docker exec test curl --unix-socket /run/podman/podman.sock 'http://d/v1.0.0/libpod/images/pull?reference=k8s.gcr.io/pause:3.2' -XPOST
docker exec test curl --unix-socket /run/podman/podman.sock http://d/v1.0.0/libpod/containers/create -XPOST --header 'content-type: application/json' --data '{"image":"k8s.gcr.io/pause:3.2","name":"bg-test","netns":{"nsmode":"host"}}'
# watch event stream & start the container 
docker exec test podman events &
docker exec test curl --unix-socket /run/podman/podman.sock http://d/v1.0.0/libpod/containers/bg-test/start -XPOST
# ... observe the container stop after 10s
sleep 15s
# cleanup our repro
docker rm -f test

I also tried to repro on the fedora:32 image but fedora's podman wouldn't run within docker, said something about overlay-on-overlay. I didn't debug further.

Additional information you deem important (e.g. issue happens only occasionally):

Pod infra containers don't seem affected, when the /pods/.../start API is used.

I wasn't able to repro when podman system service is run manually from an interactive session. I've only seen this happen specifically when the podman daemon's lifecycle is managed by systemd.

Output of podman version:

Version:      2.0.4
API Version:  1
Go Version:   go1.14
Built:        Wed Dec 31 16:00:00 1969
OS/Arch:      linux/amd64

Output of `podman info --debug`:

``` host: arch: amd64 buildahVersion: 1.15.0 cgroupVersion: v1 conmon: package: 'conmon: /usr/libexec/podman/conmon' path: /usr/libexec/podman/conmon version: 'conmon version 2.0.18, commit: ' cpus: 4 distribution: distribution: debian version: "10" eventLogger: file hostname: redapu idMappings: gidmap: null uidmap: null kernel: 4.19.0-9-amd64 linkmode: dynamic memFree: 871215104 memTotal: 4121145344 ociRuntime: name: runc package: 'runc: /usr/sbin/runc' path: /usr/sbin/runc version: |- runc version 1.0.0~rc6+dfsg1 commit: 1.0.0~rc6+dfsg1-3 spec: 1.0.1 os: linux remoteSocket: path: /run/podman/podman.sock rootless: false slirp4netns: executable: "" package: "" version: "" swapFree: 0 swapTotal: 0 uptime: 714h 51m 22.19s (Approximately 29.75 days) registries: search: - docker.io - registry.fedoraproject.org - registry.access.redhat.com store: configFile: /etc/containers/storage.conf containerStore: number: 2 paused: 0 running: 2 stopped: 0 graphDriverName: overlay graphOptions: {} graphRoot: /var/lib/containers/storage graphStatus: Backing Filesystem: extfs Native Overlay Diff: "true" Supports d_type: "true" Using metacopy: "false" imageStore: number: 6 runRoot: /var/run/containers/storage volumePath: /var/lib/containers/storage/volumes version: APIVersion: 1 Built: 0 BuiltTime: Wed Dec 31 16:00:00 1969 GitCommit: "" GoVersion: go1.14 OsArch: linux/amd64 Version: 2.0.4 ```

Package info (e.g. output of rpm -q podman or apt list podman):

podman/unknown,now 2.0.4~1 amd64 [installed]

Additional environment details (AWS, VirtualBox, physical, etc.):

Discovered on Debian 10 on bare metal, rootful podman 2.0.4.
Repro'd further using a Debian sid image via docker[-in-containerd], rootful podman 2.0.4

mheon commented 4 years ago

@baude @jwhonce @vrothberg PTAL

I don't think we've seen this elsewhere in our testing, which is making me think this is Debian-specific. Does this happen as rootless, root, or both?

Suspicion: systemd is blowing away all containers when it throttles down due to KillMode on the unit file. I'd expect that moving the containers out of the unit's cgroup would fix this for root containers, but I could be wrong.

jwhonce commented 4 years ago

@mheon I have not seen this on fedora or rhel using systemd.

danopia commented 4 years ago

I'd believe that this is debian specific. I have not had a chance to try rootless yet, will see if I can get a repro.

I just tried adding KillMode=process to the podman units but it didn't seem to change anything. Not sure if I applied it right. I'm not too handy with systemd.

Separately, I noticed that systemd is not too happy about the podman shutdown in general. Here's several systemctl status podman.service snapshots while running that docker script above:

While unix socket is active and in use

``` * podman.service - Podman API Service Loaded: loaded (/lib/systemd/system/podman.service; enabled; vendor preset: enabled) Active: active (running) since Tue 2020-08-11 16:56:44 UTC; 8s ago TriggeredBy: * podman.socket Docs: man:podman-system-service(1) Main PID: 3859 (podman) Tasks: 11 (limit: 8924) Memory: 10.9M CGroup: /kubepods/burstable/podb8a75f5d-8b75-4e4a-9604-e127f52c0ae5/31baac9f33dab30776c93283f418471af3b5ba3d8ddc1c6d14bd8654b657497f/kubepods/burstable/podb8a75f5d-8b75-4e4a-9604-e127f52c0ae5/31baac9f33dab30776c93283f418471af3b5ba3d8ddc1c6d14bd8654b657497f/docker/fa6977c78428cb8fd7c9a610f13730c143d651987c0a0e6ea95f314b23e2ebc4/system.slice/podman.service |-3859 /usr/bin/podman system service `-3886 /usr/libexec/podman/conmon --api-version 1 -c acb61eb4a3d7b3f509a88bfcee7f3bea85d6f7f6d3b78c86a327f1cfce295049 -u acb61eb4a3d7b3f509a88bfcee7f3bea85d6f7f6d3b78c86a327f1cfce295049 -r /usr/bin/runc -b /var/lib/containers/storage/vfs-containers/acb61eb4a3d7b3f509a88bfcee7f3bea85d6f7f6d3b78c86a327f1cfce295049/userdata -p /var/run/containers/storage/vfs-containers/acb61eb4a3d7b3f509a88bfcee7f3bea85d6f7f6d3b78c86a327f1cfce295049/userdata/pidfile -n bg-test --exit-dir /var/run/libpod/exits --socket-dir-path /var/run/libpod/socket -s -l k8s-file:/var/lib/containers/storage/vfs-containers/acb61eb4a3d7b3f509a88bfcee7f3bea85d6f7f6d3b78c86a327f1cfce295049/userdata/ctr.log --log-level error --runtime-arg --log-format=json --runtime-arg --log --runtime-arg=/var/run/containers/storage/vfs-containers/acb61eb4a3d7b3f509a88bfcee7f3bea85d6f7f6d3b78c86a327f1cfce295049/userdata/oci-log --conmon-pidfile /var/run/containers/storage/vfs-containers/acb61eb4a3d7b3f509a88bfcee7f3bea85d6f7f6d3b78c86a327f1cfce295049/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /var/run/containers/storage --exit-command-arg --log-level --exit-command-arg error --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /var/run/libpod --exit-command-arg --runtime --exit-command-arg runc --exit-command-arg --events-backend --exit-command-arg file --exit-command-arg container --exit-command-arg cleanup --exit-command-arg acb61eb4a3d7b3f509a88bfcee7f3bea85d6f7f6d3b78c86a327f1cfce295049 Aug 11 16:56:44 fa6977c78428 systemd[1]: Started Podman API Service. ```

Just after the timeout, after containers already died; unit is 'deactivating' for some time

``` * podman.service - Podman API Service Loaded: loaded (/lib/systemd/system/podman.service; enabled; vendor preset: enabled) Active: deactivating (stop-sigterm) since Tue 2020-08-11 16:56:54 UTC; 15s ago TriggeredBy: * podman.socket Docs: man:podman-system-service(1) Process: 3859 ExecStart=/usr/bin/podman system service (code=exited, status=0/SUCCESS) Main PID: 3859 (code=exited, status=0/SUCCESS) Tasks: 0 (limit: 8924) Memory: 1.1M CGroup: /kubepods/burstable/podb8a75f5d-8b75-4e4a-9604-e127f52c0ae5/31baac9f33dab30776c93283f418471af3b5ba3d8ddc1c6d14bd8654b657497f/kubepods/burstable/podb8a75f5d-8b75-4e4a-9604-e127f52c0ae5/31baac9f33dab30776c93283f418471af3b5ba3d8ddc1c6d14bd8654b657497f/docker/fa6977c78428cb8fd7c9a610f13730c143d651987c0a0e6ea95f314b23e2ebc4/system.slice/podman.service Aug 11 16:56:44 fa6977c78428 systemd[1]: Started Podman API Service. ```

Minutes later: unit is failed

``` * podman.service - Podman API Service Loaded: loaded (/lib/systemd/system/podman.service; enabled; vendor preset: enabled) Active: failed (Result: timeout) since Tue 2020-08-11 16:58:25 UTC; 1min 19s ago TriggeredBy: * podman.socket Docs: man:podman-system-service(1) Process: 3859 ExecStart=/usr/bin/podman system service (code=exited, status=0/SUCCESS) Main PID: 3859 (code=exited, status=0/SUCCESS) Aug 11 16:56:44 fa6977c78428 systemd[1]: Started Podman API Service. Aug 11 16:58:25 fa6977c78428 systemd[1]: podman.service: State 'stop-sigterm' timed out. Killing. Aug 11 16:58:25 fa6977c78428 systemd[1]: podman.service: Failed with result 'timeout'. ```

On any further request the unit recovers back to active and services the request. Not sure how relevant that is.. in fact I see podman system service not obeying ^C when when interactively, so probably not directly relevant to this bug report.

So, possibly some missed config in the debian packaging? I've repro'd using packages from both kubic and debian-testing.

mheon commented 4 years ago

Can you try KillMode=none instead?

danopia commented 4 years ago

Ah crap, I patched the user unit instead of the system unit.

Adding KillMode=process to /lib/systemd/system/podman.service is a successful workaround in my docker repro.

Seems like nothing unintentional leaked?

# docker exec test ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.2  0.1  24016 10700 ?        Ss   17:30   0:00 /lib/systemd/systemd --system --deserialize 14
message+     129  0.0  0.0   6932  4252 ?        Ss   17:30   0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
root         131  0.1  0.0  16580  6700 ?        Ss   17:30   0:00 /lib/systemd/systemd-logind
root         321  0.1  0.1  54380 11608 ?        Ss   17:30   0:00 /lib/systemd/systemd-journald
root        4144  0.0  0.0  81156  1960 ?        Ssl  17:33   0:00 /usr/libexec/podman/conmon --api-version 1 -c 1dca36ce88bad40ff34ee36042be6ccd9185389142fe3e040f76459a0a2c3331 -u 1dca36ce88bad40ff34ee36042be6ccd9185389142fe3e040f76459a0a2c3331 -r /usr/bin/runc -b /var/lib/containers/storage/vfs-containers/1dca36ce88bad40ff34ee36042be6ccd9185389142fe3e040f76459a0a2c3331/userdata -p /var/run/containers/storage/vfs-containers/1dca36ce88bad40ff34ee36042be6ccd9185389142fe3e040f76459a0a2c3331/userdata/pidfile -n bg-test --exit-dir /var/run/libpod/exits --socket-dir-path /var/run/libpod/socket -s -l k8s-file:/var/lib/containers/storage/vfs-containers/1dca36ce88bad40ff34ee36042be6ccd9185389142fe3e040f76459a0a2c3331/userdata/ctr.log --log-level error --runtime-arg --log-format=json --runtime-arg --log --runtime-arg=/var/run/containers/storage/vfs-containers/1dca36ce88bad40ff34ee36042be6ccd9185389142fe3e040f76459a0a2c3331/userdata/oci-log --conmon-pidfile /var/run/containers/storage/vfs-containers/1dca36ce88bad40ff34ee36042be6ccd9185389142fe3e040f76459a0a2c3331/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /var/run/containers/storage --exit-command-arg --log-level --exit-command-arg error --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /var/run/libpod --exit-command-arg --runtime --exit-command-arg runc --exit-command-arg --events-backend --exit-command-arg file --exit-command-arg container --exit-command-arg cleanup --exit-command-arg 1dca36ce88bad40ff34ee36042be6ccd9185389142fe3e040f76459a0a2c3331
root        4156  0.2  0.0    968     4 ?        Ss   17:33   0:00 /pause
root        4276  0.0  0.0   6624  2892 ?        Rs   17:35   0:00 ps aux

systemd noted what leaked, too:

Aug 11 17:33:31 8b373e600677 systemd[1]: podman.service: Unit process 4144 (conmon) remains running after unit stopped.

Do we have an explanation for why this would be debian specific? I do not see KillMode in podman.service.

mheon commented 4 years ago

This was a quite recent change - @vrothberg PTAL, I think your KillMode change may have broken things.

vrothberg commented 4 years ago

We changed to using the default killmode (i.e., cgroups) because of https://github.com/containers/podman/issues/7021 (Cc @martinpitt).

FWIW, I can reproduce on Fedora as well. The problem I think we're having is that conmons and containers are in a sub-cgroup and are hence killed when the service stops.

@mheon @baude, I guess that's something we need to fix before 2.0.5.

I suspect that https://github.com/containers/podman/issues/7021 happened in the user service because of Podman rexecing to join the user namespace? We don't really have a mechanism yet for Podman to either write it's PID or send it via sd notify.

What I think we should do, now as we have nice sd notify support, is to check in podman-system-service if we're running in a systemd service and then send READY and PID.

@giuseppe WDYT?

giuseppe commented 4 years ago

If we specify KillMode=process then conmon will be targetted then it is a race between conmon being KILLed if it doesn't terminate before the container does. Instead with KillMode=cgroup all the processes inside of the container get the signal (do we even honor the SIGNAL configuration in this case)?

IMO we should use KillMode=none and let podman manage how to kill the container.

mheon commented 4 years ago

Implementing sd-notify for podman system service would seem to make sense, given it's expected to run systemd socket-activated and (as rootless) fork to join the userns.

rhatdan commented 4 years ago

Lets attempt the sd-notify.

edsantiago commented 4 years ago

The killing of pause is one symptom of this bug; here is another reproducer I found with much worse symptoms:

# (podman-remote run -d --name foo busybox sleep 60;sleep 11;echo here we go;podman-remote ps -a)|ts
Aug 13 13:16:33 01da3fa8a9e8e7ce0ac470f8ff5856676d4a6906837eb87b8b55d8118b48b342
Aug 13 13:16:44 here we go
Aug 13 13:17:33 CONTAINER ID  IMAGE                             COMMAND   CREATED             STATUS                             PORTS   NAMES
Aug 13 13:17:33 01da3fa8a9e8  docker.io/library/busybox:latest  sleep 60  About a minute ago  Exited (0) Less than a second ago          foo

Translation:

podman system service does die, as reported
the podman container does not necessarily die - I think there is something special about pause, perhaps in its signal handling, that causes it to exit.
once podman system service dies, no more podman-remote. ps hangs until the container exits. In this case it's just a few seconds, because it was a sleep 60, but in a long-running container podman ps would hang as long as I'd let it (many minutes).

As best I can tell, #7312 solves this behavior, but I'm still looking into it further. I am reporting this because others may experience this different symptom and search for "podman remote ps hang".

containers / podman

Containers started using socket-activated APIv2 die from systemd activation timeout #7294