User Podman Services (podman.service/podman.socket) fail within 24 hrs

iUnknwn commented 3 years ago

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

User podman services (podman.socket and podman.service) fail within 24 hours of a system reboot. While user podman containers continue to run, the systemctl log shows both units as failed.

Output from podman.service journal:

Jun 07 22:50:27 local.lan systemd[1234]: Failed to start Podman API Service.
Jun 07 22:50:27 local.lan systemd[1234]: podman.service: Failed to allocate exec_fd pipe: Too many open files
Jun 07 22:50:27 local.lan systemd[1234]: podman.service: Failed to run 'start' task: Too many open files
Jun 07 22:50:27 local.lan systemd[1234]: podman.service: Failed with result 'resources'.

Output from podman.socket journal:

Jun 07 22:50:35 local.lan systemd[1234]: Listening on Podman API Socket.
Jun 07 22:50:36 local.lan systemd[1234]: podman.socket: Trigger limit hit, refusing further activation.
Jun 07 22:50:36 local.lan systemd[1234]: podman.socket: Failed with result 'trigger-limit-hit'.

Both these issues look similar to previously closed issues (https://github.com/containers/podman/issues/6093 and https://github.com/containers/podman/issues/5150) but (unless I'm reading them wrong) fixes for those issues should have been merged a while ago.

Steps to reproduce the issue:

Generate a rootless container (I started 'docker.io/thelounge/thelounge:latest') and create a corresponding user systemctl unit.
Allow to run for 24 hours.
Run systemctl --user status - the system will show as degraded. If systemctl list-units --failed is run, both podman.socket and podman.service show as failed.

Describe the results you received: Podman systemd units failed.

Describe the results you expected: Podman services to continue working normally.

Additional information you deem important (e.g. issue happens only occasionally): Both appear to be online and working at system start.

Output of podman version:

podman version 3.1.0-dev

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.19.8
  cgroupManager: cgroupfs
  cgroupVersion: v1
  conmon:
    package: conmon-2.0.27-1.module_el8.5.0+733+9bb5dffa.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.27, commit: dc08a6edf03cc2dadfe803eac14b896b44cc4721'
  cpus: 4
  distribution:
    distribution: '"centos"'
    version: "8"
  eventLogger: file
  hostname: local.lan
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 4.18.0-305.3.1.el8.x86_64
  linkmode: dynamic
  memFree: 13275705344
  memTotal: 16480956416
  ociRuntime:
    name: runc
    package: runc-1.0.0-70.rc92.module_el8.5.0+733+9bb5dffa.x86_64
    path: /usr/bin/runc
    version: 'runc version spec: 1.0.2-dev'
  os: linux
  remoteSocket:
    exists: true
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_NET_RAW,CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    selinuxEnabled: true
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.1.8-1.module_el8.5.0+733+9bb5dffa.x86_64
    version: |-
      slirp4netns version 1.1.8
      commit: d361001f495417b880f20329121e3aa431a8f90f
      libslirp: 4.3.1
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.1
  swapFree: 3670011904
  swapTotal: 3670011904
  uptime: 30h 19m 58.74s (Approximately 1.25 days)
registries:
  search:
  - registry.access.redhat.com
  - registry.redhat.io
  - docker.io
store:
  configFile: /home/USERNAME/.config/containers/storage.conf
  containerStore:
    number: 1
    paused: 0
    running: 1
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: fuse-overlayfs-1.5.0-1.module_el8.5.0+733+9bb5dffa.x86_64
      Version: |-
        fusermount3 version: 3.2.1
        fuse-overlayfs: version 1.5
        FUSE library version 3.2.1
        using FUSE kernel interface version 7.26
  graphRoot: /home/USERNAME/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 1
  runRoot: /run/user/1000/containers
  volumePath: /home/USERNAME/.local/share/containers/storage/volumes
version:
  APIVersion: 3.1.0-dev
  Built: 1616783523
  BuiltTime: Fri Mar 26 11:32:03 2021
  GitCommit: ""
  GoVersion: go1.16.1
  OsArch: linux/amd64
  Version: 3.1.0-dev

Package info (e.g. output of rpm -q podman or apt list podman):

podman-3.1.0-0.13.module_el8.5.0+733+9bb5dffa.x86_64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? Checked trouble shooting guide. While not the latest version, it looks like these issues were fixed in podman 1.9.

Additional environment details (AWS, VirtualBox, physical, etc.): Physical system running Centos Stream 8.

rhatdan commented 3 years ago

Are you constantly using the service or is the service sitting idle? This might be us leaking a socket/open FD. @jwhonce @baude @mheon WDYT

mheon commented 3 years ago

First observation - that Podman release is a little old, there should be something more recent tagged into the Stream 8 repos by now?

For debugging, can you do an lsof | grep podman and see if anything comes up? It certainly seems like we're leaking files but I'm not entirely certain what they would be.

jwhonce commented 3 years ago

@vrothberg There are logs attached to the BZ. Have you ever seen the following from systemd?

Jun 07 03:01:11 localhost systemd[4292]: podman.service: Found left-over process 11321 (n/a) in control group while starting unit. Ignoring.
Jun 07 03:01:11 localhost systemd[4292]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.

which leads to

Jun 07 03:01:11 localhost systemd[4292]: podman.service: Failed to allocate exec_fd pipe: Too many open files
Jun 07 03:01:11 localhost systemd[4292]: podman.service: Failed to run 'start' task: Too many open files
Jun 07 03:01:11 localhost systemd[4292]: podman.service: Failed with result 'resources'.
Jun 07 03:01:11 localhost systemd[4292]: Failed to start Podman API Service.

TomSweeneyRedHat commented 3 years ago

Also reported in this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1968210

iUnknwn commented 3 years ago

@rhatdan The only thing that I suspect might communicating with the service is cockpit (beyond whatever default services are in a clean install), but I was only logged into the cockpit console briefly when the system rebooted. After that it was sitting idle.

@mheon After running lsof | grep podman I noticed something interesting. Not sure if it's relevant, but I've included it here.

If I run lsof | grep podman - there aren't that many files:

systemd       1                           root  108u     unix 0xffff992a44fb3f00         0t0      32316 /run/podman/podman.sock type=STREAM
podman     1315                           USERNAME  cwd       DIR              253,0        4096  671088769 /home/USERNAME
podman     1315                           USERNAME  rtd       DIR              253,0         224        128 /
podman     1315                           USERNAME  txt       REG              253,0    47937824     854298 /usr/bin/podman
podman     1315                           USERNAME  mem       REG              253,0     3167848   67348405 /usr/lib64/libc-2.28.so
podman     1315                           USERNAME  mem       REG              253,0       69032   67348423 /usr/lib64/librt-2.28.so
podman     1315                           USERNAME  mem       REG              253,0      132888   67348908 /usr/lib64/libseccomp.so.2.5.1
podman     1315                           USERNAME  mem       REG              253,0      145984   67348618 /usr/lib64/libgpg-error.so.0.24.2
podman     1315                           USERNAME  mem       REG              253,0       92664   67348876 /usr/lib64/libassuan.so.0.8.1
podman     1315                           USERNAME  mem       REG              253,0      334384   68216694 /usr/lib64/libgpgme.so.11.22.1
podman     1315                           USERNAME  mem       REG              253,0       28856   67348407 /usr/lib64/libdl-2.28.so
podman     1315                           USERNAME  mem       REG              253,0      320712   67348419 /usr/lib64/libpthread-2.28.so
podman     1315                           USERNAME  mem       REG              253,0      278536   67348385 /usr/lib64/ld-2.28.so
podman     1315                           USERNAME    0u      CHR                1,3         0t0       1045 /dev/null
podman     1315                           USERNAME    1u      CHR                1,3         0t0       1045 /dev/null
podman     1315                           USERNAME    2u      CHR                1,3         0t0       1045 /dev/null
exe        1386                           USERNAME  txt       REG              253,0    47937824     854298 /usr/bin/podman
exe        1386  1390 exe                 USERNAME  txt       REG              253,0    47937824     854298 /usr/bin/podman
exe        1386  1391 exe                 USERNAME  txt       REG              253,0    47937824     854298 /usr/bin/podman
exe        1386  1392 exe                 USERNAME  txt       REG              253,0    47937824     854298 /usr/bin/podman
exe        1386  1393 exe                 USERNAME  txt       REG              253,0    47937824     854298 /usr/bin/podman
exe        1386  1395 exe                 USERNAME  txt       REG              253,0    47937824     854298 /usr/bin/podman
exe        1386  1397 exe                 USERNAME  txt       REG              253,0    47937824     854298 /usr/bin/podman
exe        1386  1399 exe                 USERNAME  txt       REG              253,0    47937824     854298 /usr/bin/podman
exe        1386  1409 exe                 USERNAME  txt       REG              253,0    47937824     854298 /usr/bin/podman
exe        1386  1410 exe                 USERNAME  txt       REG              253,0    47937824     854298 /usr/bin/podman
exe        1386  1543 exe                 USERNAME  txt       REG              253,0    47937824     854298 /usr/bin/podman
exe        1386  1660 exe                 USERNAME  txt       REG              253,0    47937824     854298 /usr/bin/podman
exe        1386 13582 exe                 USERNAME  txt       REG              253,0    47937824     854298 /usr/bin/podman
exe        1408                           USERNAME  txt       REG              253,0    47937824     854298 /usr/bin/podman
exe        1408  1413 exe                 USERNAME  txt       REG              253,0    47937824     854298 /usr/bin/podman
exe        1408  1414 exe                 USERNAME  txt       REG              253,0    47937824     854298 /usr/bin/podman
exe        1408  1415 exe                 USERNAME  txt       REG              253,0    47937824     854298 /usr/bin/podman
exe        1408  1416 exe                 USERNAME  txt       REG              253,0    47937824     854298 /usr/bin/podman
exe        1408  1417 exe                 USERNAME  txt       REG              253,0    47937824     854298 /usr/bin/podman
exe        1408  1419 exe                 USERNAME  txt       REG              253,0    47937824     854298 /usr/bin/podman
exe        1408  1420 exe                 USERNAME  txt       REG              253,0    47937824     854298 /usr/bin/podman
exe        1408  1427 exe                 USERNAME  txt       REG              253,0    47937824     854298 /usr/bin/podman

But when I ran that, I noticed there was a lot of text being written to stderr - there were many lines of lsof: no pwd entry for UID 100999, which I think is what the UID is of the rootless user.

If I ran lsof |& grep 100999, I saw a mix of things that looked like they were coming from within the container:

node       1544  1598 node              100999  mem       REG               0,50           604125601 /usr/local/bin/node (stat: No such file or directory)
node       1544  1598 node              100999  mem       REG               0,50            68136011 /lib/x86_64-linux-gnu/libresolv-2.24.so (stat: No such file or directory)
node       1544  1598 node              100999  mem       REG               0,50            68131446 /lib/x86_64-linux-gnu/libnss_dns-2.24.so (stat: No such file or directory)
node       1544  1598 node              100999  mem       REG               0,50            68131448 /lib/x86_64-linux-gnu/libnss_files-2.24.so (stat: No such file or directory)
node       1544  1598 node              100999  mem       REG               0,50           135242135 /usr/local/share/.config/yarn/global/node_modules/sqlite3/lib/binding/napi-v3-linux-x64/node_sqlite3.node (stat: No such file or directory)
node       1544  1598 node              100999  mem       REG               0,50            68131402 /lib/x86_64-linux-gnu/libc-2.24.so (stat: No such file or directory)
node       1544  1598 node              100999  mem       REG               0,50            68136009 /lib/x86_64-linux-gnu/libpthread-2.24.so (stat: No such file or directory)
node       1544  1598 node              100999  mem       REG               0,50            68131422 /lib/x86_64-linux-gnu/libgcc_s.so.1 (stat: No such file or directory)
node       1544  1598 node              100999  mem       REG               0,50            68131431 /lib/x86_64-linux-gnu/libm-2.24.so (stat: No such file or directory)
node       1544  1598 node              100999  mem       REG               0,50           202275877 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.22 (stat: No such file or directory)
node       1544  1598 node              100999  mem       REG               0,50            68131414 /lib/x86_64-linux-gnu/libdl-2.24.so (stat: No such file or directory)
node       1544  1598 node              100999  mem       REG               0,50            68126383 /lib/x86_64-linux-gnu/ld-2.24.so (stat: No such file or directory)
node       1544  1598 node              100999 NOFD                                                  /proc/1544/task/1598/fd (opendir: Permission denied)
node       1544  1599 node              100999  cwd   unknown                                        /usr/local/share/.config/yarn/global/node_modules/thelounge (stat: Permission denied)
node       1544  1599 node              100999  rtd   unknown                                        / (stat: Permission denied)
node       1544  1599 node              100999  txt   unknown                                        /usr/local/bin/node (stat: Permission denied)
node       1544  1599 node              100999  mem       REG               0,50           604125601 /usr/local/bin/node (stat: No such file or directory)
node       1544  1599 node              100999  mem       REG               0,50            68136011 /lib/x86_64-linux-gnu/libresolv-2.24.so (stat: No such file or directory)
node       1544  1599 node              100999  mem       REG               0,50            68131446 /lib/x86_64-linux-gnu/libnss_dns-2.24.so (stat: No such file or directory)
node       1544  1599 node              100999  mem       REG               0,50            68131448 /lib/x86_64-linux-gnu/libnss_files-2.24.so (stat: No such file or directory)
node       1544  1599 node              100999  mem       REG               0,50           135242135 /usr/local/share/.config/yarn/global/node_modules/sqlite3/lib/binding/napi-v3-linux-x64/node_sqlite3.node (stat: No such file or directory)
node       1544  1599 node              100999  mem       REG               0,50            68131402 /lib/x86_64-linux-gnu/libc-2.24.so (stat: No such file or directory)
node       1544  1599 node              100999  mem       REG               0,50            68136009 /lib/x86_64-linux-gnu/libpthread-2.24.so (stat: No such file or directory)

As well as blocks of 'no pwd entry for UID':

lsof: no pwd entry for UID 100999
lsof: no pwd entry for UID 100999
lsof: no pwd entry for UID 100999
lsof: no pwd entry for UID 100999

There were about 522 entries for 100999, based on wc -l.

Again, not sure if relevant, but seemed odd.

jwhonce commented 3 years ago

@iUnknwn If you add -l to the lsof invocation those messages should be silenced.

iUnknwn commented 3 years ago

@jwhonce Yep - that suppresses the errors - thank you. Not sure if the errors (or the number of files that are open in the container) are relevant for the service failures, but figured it was worth including.

github-actions[bot] commented 3 years ago

A friendly reminder that this issue had no activity for 30 days.

rhatdan commented 3 years ago

@iUnknwn is this still an issue?

iUnknwn commented 3 years ago

@rhatdan I assume so? But the latest podman in Centos Stream 8 is still 3.1.0-dev (built Mar 26, 2021).

Do you think it will behave differently in a later podman version? I'm happy to see if a new package fixes the problem (provided I can install/revert without too much disruption).

rhatdan commented 3 years ago

I have no idea if this problem is still an issue.

iUnknwn commented 3 years ago

It's still an issue running an up-to-date Centos Stream system. The system was restarted yesterday to apply updates, and I just ran systemctl list-units --user --failed:

Same error as before:

The container running in the user account is still up, though.

nitkon commented 3 years ago

Yes, I see this issue a well when Podman socket and services for multiple rootless Podman users is run for a long time(>24hrs).

rhatdan commented 3 years ago

@jwhonce PTAL, Looks like we have a leak.

kbi-user commented 3 years ago

@rhatdan still an issue - exact same problem here.

Additional notes:

Running rootless podman
I only tested containers for app. 2 hours. For the most part I was plain building containers with podman build.
Total run time of containers since last reboot (7 hours ago) was maybe 12 minutes for 5 containers.
For the most part I built (app. 30 times) container images and deleted them.
Error is the exact same as listed here before.

OS: RHEL 8.4, AMD, current

PS - edit - ADD: Just realised I have cockpit up and running all the time. Since it was mentioned, I will see how I fare without cockpit....

mipopescu commented 3 years ago

As this is the first time that I wrote on a bug, I do not know if I comply with the rules, but it might help.

Same problem here (not within 24h)

Users using rootless containers with --userns=keep-id flag and local volume mounted. They are using cockpit to start/stop containers.

OS: Red Hat Enterprise Linux release 8.4 (Ootpa)

$ podman version Version: 3.0.2-dev API Version: 3.0.0 Go Version: go1.15.13 OS/Arch: linux/amd64

$ uptime up 56 days

$ systemctl list-units --user --failed UNIT LOAD ACTIVE SUB DESCRIPTION ● podman.service loaded failed failed Podman API Service ● podman.socket loaded failed failed Podman API Socket

Logs: systemd: podman.service: Failed to allocate exec_fd pipe: Too many open files systemd: Failed to start Podman API Service.

$ lsof -l | grep podman | wc -l 195

There are a lot of (also for tasks until they reach 195 entries): podman 3455420 USERID cwd unknown /proc/3455420/cwd (readlink: Permission denied) podman 3455420 USERID rtd unknown /proc/3455420/root (readlink: Permission denied) podman 3455420 USERID txt unknown /proc/3455420/exe (readlink: Permission denied) podman 3455420 USERID NOFD /proc/3455420/fd (opendir: Permission denied)

mheon commented 3 years ago

@rhatdan @jwhonce @baude Might want to give this one some priority?

kbi-user commented 3 years ago

Current findings/impressions:

I know I had cockpit perma-on today when building, when the error came up.
tests: Using cockpit by just logging in and looking at overview almost doubled open files of the user (rootless podman). Went up to 74K open files of that user alone for 50+ container.
tests: On normal and forced container shutdown all is fine. Open files for the user return to initial range.
observation: Something looks weird, when gvfsd is also up. Can't fingerpoint it. Just some journal entries in similar time frames.
remark: I tried to force the bug, but to no avail so far after last occurrence/reboot.
remark: Forgot to list my version: podman version 3.0.2-dev (RHEL 8 current).

Fix would be kinda cool. Once FD limit is reached, the whole server is basically dead - without admin intervention.

mheon commented 3 years ago

Ah - I believe @jwhonce is working on FD leaks right now

kbi-user commented 3 years ago

@jwhonce and @mheon 8 hours later, no leak... Thats what I did:

doubled file descriptors to 512288
pushed cockpit and had it on for ~2 hours: open file descriptors spiked beyond 100K for the rootless pod user.
Optimized my containers, reducing amount of open files by the container drastically - open files are now quite consistent around 30K (sorry, work must be done)

Results:

cockpit seems to add to the issue.
considering how much I reduced open files descriptors in containers, I'd put my money on file descriptor leakage on closed containers under "specific" circumstances ( used -d and --rm at startup).

kbi-user commented 3 years ago

@jwhonce and @mheon I took the time and checked closely the past 22 hours.

Final statement: no bug from my side anymore on version 3.0.2-dev (RHEL)

Reasons:

I underestimated my open files per container at peak times.
Starting many containers in a short time (20+) escalated the issue obviously.
podman and containers close open files nicely over time. Sometimes it takes longer than expected, but it does nonetheless.
Reducing my open files, where I was excessive improved the situation. Likewise did doubling nofile.
Correctly orchestrating (staged start, automatic on-demand restart,...) my containers finally brought it all down to reasonable levels as expected.

That being said: the log error message by podman is really irritating. Also for rootless containers it becomes more of an issue than for others. A more comprehensive error message would be cool for any OS limits. Even more cool would be some kind of monitoring for OS limits used/max by podman and per container (podman show limits). Additionally stdout error message, when podman hits OS limits.

Example for last part: The issue became apparent, when podman took forever to show ps (podman ps -a) and when no new containers were created. A simple error message: "Error ...: Out of open file descriptors. Please check or increase nofile for current user." would be the icing on the cake ;)

Just my five cents...

github-actions[bot] commented 3 years ago

A friendly reminder that this issue had no activity for 30 days.

romanstech commented 3 years ago

I'm now on Podman version 3.2.3 — the same issue still exists. $ whoami podman $ less /etc/redhat-release Red Hat Enterprise Linux release 8.4 (Ootpa) $ podman --version podman version 3.2.3 $ ulimit -n 512288 $ lsof | grep podman | wc -l 24943 $ lsof -l | grep podman | wc -l 369 $ less ~/.config/systemd/user/sockets.target.wants/podman.socket [Unit] Description=Podman API Socket Documentation=man:podman-system-service(1)

[Socket] ListenStream=%t/podman/podman.sock SocketMode=0660

[Install] WantedBy=sockets.target

After 2 days running: $ systemctl --user status podman.service ● podman.service - Podman API Service Loaded: loaded (/usr/lib/systemd/user/podman.service; static; vendor preset: enabled) Active: failed (Result: resources) since Sat 2021-09-11 23:11:50 IDT; 8h ago Docs: man:podman-system-service(1) Process: 284292 ExecStart=/usr/bin/podman $LOGGING system service (code=exited, status=0/SUCCESS) Main PID: 284292 (code=exited, status=0/SUCCESS) CGroup: /user.slice/user-17038.slice/user@17038.service/podman.service ├─12114 /usr/bin/fuse-overlayfs -o ,lowerdir=/home/podman/.local/share/containers/storage/overlay/l/5B3DHXXMBMOE3AXHTVZ2WALCN4:/> ├─12119 /usr/bin/slirp4netns --disable-host-loopback --mtu=65520 --enable-sandbox --enable-seccomp -c -r 3 --netns-type=path /ru> ├─12203 containers-rootlessport ├─12212 containers-rootlessport-child ├─12223 /usr/bin/conmon --api-version 1 -c 047754c68186072730953f05bee695c902731850272d1d87522f49c4e1f0858c -u 047754c6818607273> └─12226 /portainer

Sep 11 23:16:44 comm-guac systemd[1212]: podman.service: Found left-over process 12226 (n/a) in control group while starting unit. Ignoring. Sep 11 23:16:44 comm-guac systemd[1212]: This usually indicates unclean termination of a previous run, or service implementation deficienci> Sep 11 23:16:44 comm-guac systemd[1212]: podman.service: Failed to allocate exec_fd pipe: Too many open files Sep 11 23:16:44 comm-guac systemd[1212]: podman.service: Failed to run 'start' task: Too many open files Sep 11 23:16:44 comm-guac systemd[1212]: podman.service: Failed with result 'resources'. Sep 11 23:16:44 comm-guac systemd[1212]: Failed to start Podman API Service. Sep 12 07:39:04 comm-guac systemd[1212]: podman.service: Failed to allocate exec_fd pipe: Too many open files Sep 12 07:39:04 comm-guac systemd[1212]: podman.service: Failed to run 'start' task: Too many open files Sep 12 07:39:04 comm-guac systemd[1212]: podman.service: Failed with result 'resources'. Sep 12 07:39:04 comm-guac systemd[1212]: Failed to start Podman API Service.

$ systemctl --user status podman.socket ● podman.socket - Podman API Socket Loaded: loaded (/usr/lib/systemd/user/podman.socket; enabled; vendor preset: enabled) Active: active (listening) since Sun 2021-09-12 07:39:04 IDT; 6min ago Docs: man:podman-system-service(1) Listen: /run/user/17038/podman/podman.sock (Stream) CGroup: /user.slice/user-17038.slice/user@17038.service/podman.socket

Sep 12 07:39:04 comm-guac systemd[1212]: Listening on Podman API Socket.

There were zero activity for any containers from user side (days off), 3 containers just ran, only one of them depends of socket (portainer). Now portainer can't connect to the socket because of podman.service failure.

mheon commented 3 years ago

Please retry with 3.3 - a number of fixes from @jwhonce landed in that release, which should alleviate the issue.

romanstech commented 3 years ago

Please retry with 3.3 - a number of fixes from @jwhonce landed in that release, which should alleviate the issue.

I still don't see v.3.3 in RHEL repository so I must wait till RedHat puts the version in their repo. And I don't want to compile Podman from sources.

FYI, I use crun as OCI runtime. BTW, I've enabled cgroups-v2 (https://www.redhat.com/en/blog/world-domination-cgroups-rhel-8-welcome-cgroups-v2). $ ls /sys/fs/cgroup cgroup.controllers cgroup.subtree_control init.scope system.slice cgroup.max.depth cgroup.threads io.pressure user.slice cgroup.max.descendants cpu.pressure io.stat cgroup.procs cpuset.cpus.effective memory.pressure cgroup.stat cpuset.mems.effective memory.stat

Disabled PID limiting for the container because of RHEL 8.4 bug(https://access.redhat.com/solutions/5913671): # vim /usr/share/containers/containers.conf: [containers] pids_limit=0

After reboot all my containers are up. What I observe is that podman.service almost all the time is inactive, podman.socket all the time is active. podman.service becomes active and answers only when any container uses podman.socket. Then after the usage the podman.service becomes inactive till the next API request like this one: $ curl -H "Content-Type: application/json" --unix-socket $XDG_RUNTIME_DIR/podman/podman.sock http://localhost/_ping

Is this a normal behaviour?

I will continue observing the issue and update the case later.

mheon commented 3 years ago

Yes, that is normal. The service is shutting down when not in use to lower resource use at idle.

romanstech commented 3 years ago

Yes, that is normal. The service is shutting down when not in use to lower resource use at idle.

Thank you for the answer.

I want to monitor Podman REST API from an external server. As I understand by default REST API available only on localhost. Do you have ready-to-use configuration that allows to publish Podman REST API on host's interface to be able to connect to it from another server? Then I will configure Nginx as a reverse proxy and add an authentication. May be you have a link to such solution? It will be great.

Thanks in advance.

romanstech commented 3 years ago

The same behaviour:

podman.service - Podman API Service Loaded: loaded (/usr/lib/systemd/user/podman.service; static; vendor preset: enabled) Active: failed (thawing) (Result: resources) since Mon 2021-09-13 04:29:36 IDT; 2h 57min ago Docs: man:podman-system-service(1) Process: 785578 ExecStart=/usr/bin/podman $LOGGING system service (code=exited, status=0/SUCCESS) Main PID: 785578 (code=exited, status=0/SUCCESS)

Sep 13 04:30:31 comm-guac systemd[1167]: podman.service: Failed to set memory.swap.max: Too many open files Sep 13 04:30:31 comm-guac systemd[1167]: podman.service: Failed to set pids.max: Too many open files Sep 13 04:30:31 comm-guac systemd[1167]: podman.service: Failed to allocate exec_fd pipe: Too many open files Sep 13 04:30:31 comm-guac systemd[1167]: podman.service: Failed to run 'start' task: Too many open files Sep 13 04:30:31 comm-guac systemd[1167]: podman.service: Failed with result 'resources'. Sep 13 04:30:31 comm-guac systemd[1167]: Failed to start Podman API Service. Sep 13 04:30:31 comm-guac systemd[1167]: podman.service: Failed to allocate exec_fd pipe: Too many open files Sep 13 04:30:31 comm-guac systemd[1167]: podman.service: Failed to run 'start' task: Too many open files Sep 13 04:30:31 comm-guac systemd[1167]: podman.service: Failed with result 'resources'. Sep 13 04:30:31 comm-guac systemd[1167]: Failed to start Podman API Service.

● podman.socket - Podman API Socket Loaded: loaded (/usr/lib/systemd/user/podman.socket; enabled; vendor preset: enabled) Active: failed (Result: trigger-limit-hit) since Mon 2021-09-13 04:30:31 IDT; 2h 56min ago Docs: man:podman-system-service(1) Listen: /run/user/17038/podman/podman.sock (Stream)

Sep 12 10:45:47 comm-guac systemd[1167]: Listening on Podman API Socket. Sep 13 04:30:31 comm-guac systemd[1167]: podman.socket: Trigger limit hit, refusing further activation. Sep 13 04:30:31 comm-guac systemd[1167]: podman.socket: Failed to kill control group /user.slice/user-17038.slice/user@17038.service/podman.sock et, ignoring: Too many open files Sep 13 04:30:31 comm-guac systemd[1167]: podman.socket: Failed to kill control group /user.slice/user-17038.slice/user@17038.service/podman.sock et, ignoring: Too many open files Sep 13 04:30:31 comm-guac systemd[1167]: podman.socket: Failed with result 'trigger-limit-hit'.

$ ulimit -n 512288 $ lsof -l | grep podman | wc -l 517 $ lsof | grep podman | wc -l 61111

It means that it took 14–16 hours to fail. Will wait for v.3.3.

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.

rhatdan commented 2 years ago

Since podman 3.4 is released, we believe this is now fixed.

romanstech commented 2 years ago

RHEL8 finally received podman v.3.3.1. I've been testing for the socket during the last 2 days and it seems that the socket works stable.

Thank you very much.

$ systemctl --user status podman.socket ● podman.socket - Podman API Socket Loaded: loaded (/usr/lib/systemd/user/podman.socket; enabled; vendor preset: enabled) Active: active (listening) since Sun 2021-11-14 09:48:58 IST; 2 days ago Docs: man:podman-system-service(1) Listen: /run/user/17038/podman/podman.sock (Stream) Tasks: 0 (limit: 24915) Memory: 0B CGroup: /user.slice/user-17038.slice/user@17038.service/podman.socket

Name : podman Version : 3.3.1 Release : 9.module+el8.5.0+12697+018f24d7 Architecture : x86_64 Size : 48 M Source : podman-3.3.1-9.module+el8.5.0+12697+018f24d7.src.rpm Repository : @System From repo : rhel-8-for-x86_64-appstream-rpms Summary : Manage Pods, Containers and Container Images

gamaraan commented 2 years ago

I still experience the same behavior on RHEL 8.5 with podman 3.3.1. It can be reproduced within hours after server reboot.


Operating System: Red Hat Enterprise Linux 8.5 (Ootpa)
       CPE OS Name: cpe:/o:redhat:enterprise_linux:8::baseos
            Kernel: Linux 4.18.0-348.12.2.el8_5.x86_64
      Architecture: x86-64

podman --version
podman version 3.3.1

ulimit -n
512288

curl -H "Content-Type: application/json" --unix-socket $XDG_RUNTIME_DIR/podman/podman.sock http://localhost/_ping
curl: (7) Couldn't connect to server

systemctl --user status podman.socket
● podman.socket - Podman API Socket
   Loaded: loaded (/usr/lib/systemd/user/podman.socket; enabled; vendor preset: enabled)
   Active: failed (Result: trigger-limit-hit) since Wed 2022-02-09 00:11:00 UTC; 8h ago
     Docs: man:podman-system-service(1)
   Listen: /run/user/1002/podman/podman.sock (Stream)

Feb 08 07:43:08 ... systemd[1573]: Listening on Podman API Socket.
Feb 09 00:11:00 ... systemd[1573]: podman.socket: Trigger limit hit, refusing further activation.
Feb 09 00:11:00 ... systemd[1573]: podman.socket: Failed with result 'trigger-limit-hit'.

mheon commented 2 years ago

Please open a fresh issue (or, even better, a Bugzilla)

containers / podman

User Podman Services (podman.service/podman.socket) fail within 24 hrs #10593