containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.42k stars 2.38k forks source link

Rootless API closes /events endpoint after minute and half #8751

Closed marusak closed 3 years ago

marusak commented 3 years ago

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description When I try to listen on /events endpoint as non-root user it closes after minute and half. This started to happen since podman-2.2.1 release in Fedora 32/33.

Steps to reproduce the issue:

  1. $ systemctl --user start podman.socket
  2. 
    $ time  curl --unix-socket /run/user/1000/podman/podman.sock http://d/v1.24/events
    curl: (18) transfer closed with outstanding read data remaining

real 1m35.054s user 0m0.013s sys 0m0.017s


**Describe the results you received:**
The API closes.

**Describe the results you expected:**
The API will continue to listen.

**Additional information you deem important (e.g. issue happens only occasionally):**

$ systemctl --user start podman.socket

$ systemctl --user status podman ● podman.service - Podman API Service Loaded: loaded (/usr/lib/systemd/user/podman.service; static; vendor preset: disabled) Active: inactive (dead) TriggeredBy: ● podman.socket Docs: man:podman-system-service(1)

$ time curl --unix-socket /run/user/1000/podman/podman.sock http://d/v1.24/events curl: (18) transfer closed with outstanding read data remaining

real 1m35.054s user 0m0.013s sys 0m0.017s

$ systemctl --user status podman ● podman.service - Podman API Service Loaded: loaded (/usr/lib/systemd/user/podman.service; static; vendor preset: disabled) Active: failed (Result: timeout) since Wed 2020-12-16 11:59:48 EST; 3s ago TriggeredBy: ● podman.socket Docs: man:podman-system-service(1) Process: 17474 ExecStart=/usr/bin/podman $LOGGING system service (code=exited, status=1/FAILURE) Main PID: 17474 (code=exited, status=1/FAILURE) CPU: 131ms

Dec 16 11:58:13 m1.cockpit.lan systemd[17434]: podman.service: Got notification message from PID 17519, but reception only permitted for main PID 17474 Dec 16 11:58:13 m1.cockpit.lan podman[17519]: time="2020-12-16T11:58:13-05:00" level=info msg="APIHandler(d146316c-7340-4d8a-9d52-8185cbe7792f) -- GET /v1.24/events BEGIN" Dec 16 11:59:43 m1.cockpit.lan systemd[17434]: podman.service: start operation timed out. Terminating. Dec 16 11:59:43 m1.cockpit.lan podman[17474]: time="2020-12-16T11:59:43-05:00" level=info msg="Received shutdown signal terminated, terminating!" Dec 16 11:59:43 m1.cockpit.lan podman[17519]: time="2020-12-16T11:59:43-05:00" level=info msg="Received shutdown signal terminated, terminating!" Dec 16 11:59:43 m1.cockpit.lan podman[17519]: time="2020-12-16T11:59:43-05:00" level=info msg="Invoking shutdown handler server" Dec 16 11:59:48 m1.cockpit.lan podman[17519]: time="2020-12-16T11:59:48-05:00" level=info msg="Invoking shutdown handler libpod" Dec 16 11:59:48 m1.cockpit.lan systemd[17434]: podman.service: Main process exited, code=exited, status=1/FAILURE Dec 16 11:59:48 m1.cockpit.lan systemd[17434]: podman.service: Failed with result 'timeout'. Dec 16 11:59:48 m1.cockpit.lan systemd[17434]: Failed to start Podman API Service.


A bit more logs:

$ journalctl -r -- Logs begin at Fri 2020-12-04 21:12:04 EST, end at Wed 2020-12-16 11:59:48 EST. -- Dec 16 11:59:48 m1.cockpit.lan systemd[17434]: Failed to start Podman API Service. Dec 16 11:59:48 m1.cockpit.lan systemd[17434]: podman.service: Failed with result 'timeout'. Dec 16 11:59:48 m1.cockpit.lan systemd[17434]: podman.service: Main process exited, code=exited, status=1/FAILURE Dec 16 11:59:48 m1.cockpit.lan podman[17519]: time="2020-12-16T11:59:48-05:00" level=info msg="Invoking shutdown handler libpod" Dec 16 11:59:43 m1.cockpit.lan podman[17519]: time="2020-12-16T11:59:43-05:00" level=info msg="Invoking shutdown handler server" Dec 16 11:59:43 m1.cockpit.lan podman[17519]: time="2020-12-16T11:59:43-05:00" level=info msg="Received shutdown signal terminated, terminating!" Dec 16 11:59:43 m1.cockpit.lan podman[17474]: time="2020-12-16T11:59:43-05:00" level=info msg="Received shutdown signal terminated, terminating!" Dec 16 11:59:43 m1.cockpit.lan systemd[17434]: podman.service: start operation timed out. Terminating. Dec 16 11:59:41 m1.cockpit.lan systemd[17434]: Finished Mark boot as successful. Dec 16 11:59:41 m1.cockpit.lan systemd[17434]: grub-boot-success.service: Succeeded. Dec 16 11:59:41 m1.cockpit.lan systemd[17434]: Starting Mark boot as successful... Dec 16 11:59:29 m1.cockpit.lan kernel: audit: type=1131 audit(1608137968.994:331): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=libvirtd comm="syste> Dec 16 11:59:28 m1.cockpit.lan audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=libvirtd comm="systemd" exe="/usr/lib/systemd/sys> Dec 16 11:59:28 m1.cockpit.lan systemd[1]: libvirtd.service: Succeeded. Dec 16 11:59:21 m1.cockpit.lan audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=pmie_check comm="systemd" exe="/usr/lib/systemd/s> Dec 16 11:59:21 m1.cockpit.lan kernel: audit: type=1131 audit(1608137961.238:330): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=pmie_check comm="sys> Dec 16 11:59:21 m1.cockpit.lan kernel: audit: type=1130 audit(1608137961.238:329): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=pmie_check comm="sys> Dec 16 11:59:21 m1.cockpit.lan audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=pmie_check comm="systemd" exe="/usr/lib/systemd/> Dec 16 11:59:21 m1.cockpit.lan systemd[1]: Finished Check PMIE instances are running. Dec 16 11:59:21 m1.cockpit.lan systemd[1]: pmie_check.service: Succeeded. Dec 16 11:59:21 m1.cockpit.lan kernel: audit: type=1131 audit(1608137961.230:328): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=pmlogger_check comm=> Dec 16 11:59:21 m1.cockpit.lan audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=pmlogger_check comm="systemd" exe="/usr/lib/syste> Dec 16 11:59:21 m1.cockpit.lan audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=pmlogger_check comm="systemd" exe="/usr/lib/syst> Dec 16 11:59:21 m1.cockpit.lan kernel: audit: type=1130 audit(1608137961.230:327): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=pmlogger_check comm=> Dec 16 11:59:21 m1.cockpit.lan systemd[1]: Finished Check pmlogger instances are running. Dec 16 11:59:21 m1.cockpit.lan systemd[1]: pmlogger_check.service: Succeeded. Dec 16 11:59:20 m1.cockpit.lan systemd[1]: Starting Check pmlogger instances are running... Dec 16 11:59:20 m1.cockpit.lan systemd[1]: Starting Check PMIE instances are running... Dec 16 11:58:39 m1.cockpit.lan chronyd[607]: Selected source 147.251.48.140 Dec 16 11:58:13 m1.cockpit.lan kernel: audit: type=1131 audit(1608137893.708:326): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=pmie_check comm="sys> Dec 16 11:58:13 m1.cockpit.lan audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=pmie_check comm="systemd" exe="/usr/lib/systemd/s> Dec 16 11:58:13 m1.cockpit.lan audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=pmie_check comm="systemd" exe="/usr/lib/systemd/> Dec 16 11:58:13 m1.cockpit.lan kernel: audit: type=1130 audit(1608137893.708:325): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=pmie_check comm="sys> Dec 16 11:58:13 m1.cockpit.lan systemd[1]: Finished Check PMIE instances are running. Dec 16 11:58:13 m1.cockpit.lan systemd[1]: pmie_check.service: Succeeded. Dec 16 11:58:13 m1.cockpit.lan podman[17519]: time="2020-12-16T11:58:13-05:00" level=info msg="APIHandler(d146316c-7340-4d8a-9d52-8185cbe7792f) -- GET /v1.24/events BEGIN" Dec 16 11:58:13 m1.cockpit.lan systemd[17434]: podman.service: Got notification message from PID 17519, but reception only permitted for main PID 17474 Dec 16 11:58:13 m1.cockpit.lan podman[17519]: time="2020-12-16T11:58:13-05:00" level=info msg="API server listening on \"/run/user/1000/podman/podman.sock\"" Dec 16 11:58:13 m1.cockpit.lan podman[17519]: time="2020-12-16T11:58:13-05:00" level=info msg="using API endpoint: ''" Dec 16 11:58:13 m1.cockpit.lan podman[17519]: time="2020-12-16T11:58:13-05:00" level=info msg="using systemd socket activation to determine API endpoint" Dec 16 11:58:13 m1.cockpit.lan podman[17519]: time="2020-12-16T11:58:13-05:00" level=info msg="Setting parallel job count to 4" Dec 16 11:58:13 m1.cockpit.lan podman[17519]: 2020-12-16 11:58:13.636386071 -0500 EST m=+0.049761667 system refresh Dec 16 11:58:13 m1.cockpit.lan podman[17519]: time="2020-12-16T11:58:13-05:00" level=warning msg="Error initializing configured OCI runtime kata: no valid executable found for OCI runtime k> Dec 16 11:58:13 m1.cockpit.lan podman[17519]: time="2020-12-16T11:58:13-05:00" level=info msg="/usr/bin/podman filtering at log level info" Dec 16 11:58:13 m1.cockpit.lan podman[17474]: time="2020-12-16T11:58:13-05:00" level=warning msg="Error initializing configured OCI runtime kata: no valid executable found for OCI runtime k> Dec 16 11:58:13 m1.cockpit.lan podman[17474]: time="2020-12-16T11:58:13-05:00" level=info msg="/usr/bin/podman filtering at log level info" Dec 16 11:58:13 m1.cockpit.lan systemd[1]: Starting Check PMIE instances are running... Dec 16 11:58:13 m1.cockpit.lan systemd[17434]: Starting Podman API Service... Dec 16 11:57:58 m1.cockpit.lan audit: BPF prog-id=43 op=UNLOAD Dec 16 11:57:58 m1.cockpit.lan kernel: audit: type=1334 audit(1608137878.525:324): prog-id=43 op=UNLOAD Dec 16 11:57:58 m1.cockpit.lan kernel: audit: type=1334 audit(1608137878.525:323): prog-id=44 op=UNLOAD Dec 16 11:57:58 m1.cockpit.lan audit: BPF prog-id=44 op=UNLOAD Dec 16 11:57:58 m1.cockpit.lan audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/sy> Dec 16 11:57:58 m1.cockpit.lan kernel: audit: type=1131 audit(1608137878.504:322): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-hostnamed co> Dec 16 11:57:58 m1.cockpit.lan kernel: kauditd_printk_skb: 21 callbacks suppressed Dec 16 11:57:58 m1.cockpit.lan systemd[1]: systemd-hostnamed.service: Succeeded. Dec 16 11:57:55 m1.cockpit.lan systemd[17434]: Listening on Podman API Socket.


Interestingly, when I boot VM, first time events time out after ~1.5 minute, when I try it again it never times out, any other subsequent call times out in ~1.5minute.

**Output of `podman version`:**

Version: 2.2.1 API Version: 2.1.0 Go Version: go1.15.5 Built: Tue Dec 8 09:37:50 2020 OS/Arch: linux/amd64


**Output of `podman info --debug`:**

host: arch: amd64 buildahVersion: 1.18.0 cgroupManager: systemd cgroupVersion: v2 conmon: package: conmon-2.0.21-3.fc33.x86_64 path: /usr/bin/conmon version: 'conmon version 2.0.21, commit: 0f53fb68333bdead5fe4dc5175703e22cf9882ab' cpus: 1 distribution: distribution: fedora version: "32" eventLogger: journald hostname: m1.cockpit.lan idMappings: gidmap: null uidmap: null kernel: 5.9.13-200.fc33.x86_64 linkmode: dynamic memFree: 437579776 memTotal: 1150504960 ociRuntime: name: crun package: crun-0.16-1.fc33.x86_64 path: /usr/bin/crun version: |- crun version 0.16 commit: eb0145e5ad4d8207e84a327248af76663d4e50dd spec: 1.0.0 +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL os: linux remoteSocket: exists: true path: /run/podman/podman.sock rootless: false slirp4netns: executable: "" package: "" version: "" swapFree: 1291042816 swapTotal: 1291841536 uptime: 27m 24.22s registries: search:

Package info (e.g. output of rpm -q podman or apt list podman):

podman-2.2.1-1.fc33.x86_64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide?

Yes (latest being what is in fedora repos, haven't tested current upstream HEAD)

mheon commented 3 years ago

Dec 16 11:59:43 m1.cockpit.lan podman[17474]: time="2020-12-16T11:59:43-05:00" level=info msg="Received shutdown signal terminated, terminating!"

Something is sending Podman SIGTERM. Maybe systemd?

mheon commented 3 years ago

@baude @jwhonce PTAL

mheon commented 3 years ago

Dec 16 11:59:43 m1.cockpit.lan systemd[17434]: podman.service: start operation timed out. Terminating.

This strongly suggests that Systemd is of the opinion that podman system service did not successfully start, and is sending SIGTERM to terminate what it sees as a process stuck trying to start.

marusak commented 3 years ago

Might be, it is the whole time in activating (start) state:

systemctl --user status podman
● podman.service - Podman API Service
     Loaded: loaded (/usr/lib/systemd/user/podman.service; static; vendor preset: disabled)
     Active: activating (start) since Wed 2020-12-16 12:47:20 EST; 1min 27s ago
TriggeredBy: ● podman.socket
       Docs: man:podman-system-service(1)
   Main PID: 19416 (podman)
      Tasks: 16 (limit: 1271)
     Memory: 99.7M
        CPU: 101ms
     CGroup: /user.slice/user-1000.slice/user@1000.service/podman.service
             ├─19416 /usr/bin/podman --log-level=info system service
             ├─19422 /usr/bin/podman --log-level=info system service
             └─19427 /usr/bin/podman

The system unit does not time out after this short time and it is indeed active:

systemctl status podman
● podman.service - Podman API Service
     Loaded: loaded (/usr/lib/systemd/system/podman.service; static; vendor preset: disabled)
     Active: active (running) since Wed 2020-12-16 12:51:28 EST; 11s ago
TriggeredBy: ● podman.socket
       Docs: man:podman-system-service(1)
   Main PID: 19501 (podman)
      Tasks: 8 (limit: 1271)
     Memory: 37.0M
        CPU: 99ms
     CGroup: /system.slice/podman.service
             └─19501 /usr/bin/podman --log-level=info system service
marusak commented 3 years ago

Worth mentioning this is nothing new, I was saying this is happening a while ago, see in https://github.com/containers/podman/issues/5599 4. Notice that podman.service is activating (start). ((1) Never running though - this is one interesting observation) It might be that new systemd now kills units in this state after 1.5 minute. That would explain why I cannot reproduce it on my machine directly (as I haven't restarted it in a while)

rhatdan commented 3 years ago

Type=notify The socket is set for notify. Is podman ever notifying systemd that it is ready for connections?

rhatdan commented 3 years ago

These services should be changed to type=simple, I believe. They do not send the notify reply to systemd, which must kill them after 1.5 minutes.

zhangguanzhang commented 3 years ago

close,could reopen if happened again /close

openshift-ci-robot commented 3 years ago

@zhangguanzhang: Closing this issue.

In response to [this](https://github.com/containers/podman/issues/8751#issuecomment-747233340): >close,could reopen if happened again >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
martinpitt commented 3 years ago

This bug is still in Fedora 33, upstream we have an ugly hack for it, and downstream it still breaks tests. Any chance that this could be fixed in Fedora 33? Thanks!

baude commented 3 years ago

it should come out with podman v3 which is in rc3 right now