Podman lock contention when attempting to restart multiple containers

gcs278 commented 3 years ago

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

With a restart policy as alwaysor on-failed, podman seems to really struggle and potentially deadlock when it is restarting multiple containers that are constantly exiting. I first noticed this problem with using podman play kube where a couple containers wereconstantly dying and the restart policy was always. I then added an script with just exit 1 as the entrypoint and watched podman commands being to hang longer.

I started 8 instances of exit 1 and --restart=always containers via podman run and podman commands took around 60 seconds to return. After about a minute, podman seemed to deadlock. Podman commands weren't returning and I couldn't stop any of the dying containers. I rm -f /dev/shm/libpod_lock and did a pkill podman to release the deadlock.

This is a big problem for us, as we can't trust podman to restart containers without deadlocking. This seems related to #11589, but I thought it would be better to separately track since it's a different situation.

Steps to reproduce the issue:

podman run -d --restart=always --entrypoint="" image_name bash -c "exit 1" podman run -d --restart=always --entrypoint="" image_name bash -c "exit 1" podman run -d --restart=always --entrypoint="" image_name bash -c "exit 1" podman run -d --restart=always --entrypoint="" image_name bash -c "exit 1" podman run -d --restart=always --entrypoint="" image_name bash -c "exit 1" podman run -d --restart=always --entrypoint="" image_name bash -c "exit 1" podman run -d --restart=always --entrypoint="" image_name bash -c "exit 1" podman run -d --restart=always --entrypoint="" image_name bash -c "exit 1"

Check podman commands like podman ps. See if podman deadlocks

Describe the results you received: Podman gets extremely sluggish and then deadlocks

Describe the results you expected: Podman wouldn't deadlock

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

podman version 3.2.3

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.21.3
  cgroupControllers:
  - cpuset
  - cpu
  - cpuacct
  - blkio
  - memory
  - devices
  - freezer
  - net_cls
  - perf_event
  - net_prio
  - hugetlb
  - pids
  - rdma
  cgroupManager: systemd
  cgroupVersion: v1
  conmon:
    package: conmon-2.0.29-1.module+el8.4.0+11822+6cc1e7d7.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.29, commit: ae467a0c8001179d4d0adf4ada381108a893d7ec'
  cpus: 10
  distribution:
    distribution: '"rhel"'
    version: "8.2"
  eventLogger: file
  hostname: rhel82
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 4.18.0-193.el8.x86_64
  linkmode: dynamic
  memFree: 136581120
  memTotal: 3884625920
  ociRuntime:
    name: runc
    package: runc-1.0.0-74.rc95.module+el8.4.0+11822+6cc1e7d7.x86_64
    path: /usr/bin/runc
    version: |-
      runc version spec: 1.0.2-dev
      go: go1.15.13
      libseccomp: 2.4.1
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_NET_RAW,CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: ""
    package: ""
    version: ""
  swapFree: 3990089728
  swapTotal: 4190105600
  uptime: 2159h 52m 24.42s (Approximately 89.96 days)
registries:
  registry:5000:
    Blocked: false
    Insecure: true
    Location: registry:5000
    MirrorByDigestOnly: false
    Mirrors: []
    Prefix: registry:5000
  search: ""
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 16
    paused: 0
    running: 0
    stopped: 16
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageStore:
    number: 77
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 3.2.3
  Built: 1627570963
  BuiltTime: Thu Jul 29 11:02:43 2021
  GitCommit: ""
  GoVersion: go1.15.7
  OsArch: linux/amd64
  Version: 3.2.3

Package info (e.g. output of rpm -q podman or apt list podman):

podman-3.2.3-0.11.module+el8.4.0+12050+ef972f71.x86_64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/master/troubleshooting.md)

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):

mheon commented 3 years ago

This doesn't seem like a deadlock - it more seems like Podman is constantly attempting to restart containers, resulting in at least one container having its lock taken at all times, making ps take a long time to finish as it waits to acquire locks. After 5 minutes, I haven't been able to replicate a deadlock, though podman ps is taking upwards of a minute to successfully execute. It is absolutely blowing up the load average as well - loading 8 cores to ~80% I think this is a rather inherent limitation of our daemonless architecture as each command needs to launch a Podman cleanup process to handle restart, which is resulting in a massive process storm - it's why we strongly recommend using systemd-managed containers instead.

Is this a particularly slow system you're testing on? It could explain why things appear to deadlock. I'm fairly convinced there's no actual deadlock here, just a severely taxed system.

gcs278 commented 3 years ago

Thanks for looking into this @mheon. Yea it's a dedicated server with 48 cores, the deadlocking is somewhat inconsistent for me. I tried it again and couldn't get it to deadlock, but other times, it deadlocks after the first couple restart cycle on 8 containers. I would let podman commands for 5-10 minutes before removing the lock file and killing processes.

I'm using podman play so I don't think there is an option for using systemd with podman play.

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.

vrothberg commented 2 years ago

FWIW, I think that podman ps is way too expensive. The lock of a single container is acquired and released ~ a dozen times just to query certain data (e.g., state, mappings, root FS, etc.). I think we need to optimize querying that data and put it into a single locked function (rather than N locked ones).

vrothberg commented 2 years ago

I'll take a stab at it.

vrothberg commented 2 years ago

FWIW, I think that podman ps is way too expensive. The lock of a single container is acquired and released ~ a dozen times just to query certain data (e.g., state, mappings, root FS, etc.). I think we need to optimize querying that data and put it into a single locked function (rather than N locked ones).

Scratch that ... these operations are batched.

mheon commented 2 years ago

I was seeing this earlier this week in a slightly different context (podman ps and podman rm -af), so I took a further look. Current observations support it being contention of the container locks, which is exacerbated by the amount of parallel processes we run . I believe our algorithm is CPU Cores * 3 +1, which means that on my system, I have 25 threads going for both podman ps and podman rm, each contending for CPU time, and each aggressively trying to take locks for containers they are operating on. In short, we aren't waiting on a single lock for a minute, we're waiting on a hundred locks for a second or two each. I don't really know if we can improve this easily.

One thought I have is to print results as they come, instead of all at once when the command is done. This isn't perfect, but it would be a lot more clear to the user what is happening (at least, it will be obvious that the commands are not deadlocked)

mheon commented 2 years ago

Other possible thought: randomize the order in which we act on containers. podman ps and podman rm were operating on the same set of containers in the same order, with one being a lot slower than the other, so ps was run second but caught up quickly and ended up waiting on locks until rm finished. Random ordering much improves our odds of getting containers that aren't in contention.

mheon commented 2 years ago

I added a bit of randomization to the ordering, but it wasn't enough - no appreciable increase in performance, there are still too many collisions (25 parallel jobs over 200 test containers meant ps and stop, for example, are each working on 1/8 of total containers at any given time - high odds for collisions, which cause lock contention, which cause ps to slow down....)

vrothberg commented 2 years ago

@mheon, that is a great trail your on.

Maybe we should think in terms of a work pool rather in terms of workers per caller. Could we have a global shared semaphore to limit the number of parallel batch workers? That would limit lock contention etc. AFAIK the locks are already fair.

mheon commented 2 years ago

We do have a semaphore right now, but it's per-process, not global. Making it global is potentially interesting, if we can get a MP-safe shared-memory semaphore.

mheon commented 2 years ago

Shared semaphore looks viable. My only concern is making sure that crashes and SIGKILL don't affect us - if, say, podman stop is running and using all available jobs, and then gets a SIGKILL, we want the semaphore to be released back to its maximum value.

rhatdan commented 2 years ago

@mheon Any movement on this?

mheon commented 2 years ago

Negative. Might be worth discussing at the cabal if we have time? I don't have a solid feel on how to fix this.

tyler92 commented 2 years ago

I have investigated this issue (it reproduces in my case too). Simple program based on shm_lock code shows the following picture:

LockID = 1 (Pod)              owner PID = 462221
LockID = 2 (infra container)  owner PID = 462221
LockID = 3,(app container)    owner PID = 462207

462207 is process, that is started when restart is occurred - podman container cleanup 462221 is any other process, in my case it's podman pod rm -f -a

And these processes are deadlocked because they are waiting each other (lock order problem). The simplest way to reproduce is run the following script:

#!/bin/bash

set -o errexit

for x in {1..10000};
    do echo "* $x *"
    podman play kube ./my-pod.yaml
    podman trace pod rm -f -a
    podman trace rm -a
done

where my-pod.yaml looks like:

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: my-pod
  name: my-pod
spec:
  containers:
  - name: app
    image: debian
    imagePullPolicy: Never
    command:
    - /bin/sleep
    args:
    - 0.001
  hostNetwork: true
  restartPolicy: Always

tyler92 commented 2 years ago

So it looks like we should lock of a container's Pod before lock a container. Is it a good idea?

mheon commented 2 years ago

That is definitely a separate issue, please file a new bug for it

On Tue, Jul 12, 2022 at 15:20 tyler92 @.***> wrote:

So it looks like we should lock of a container's Pod before lock a container. Is it a good idea?

— Reply to this email directly, view it on GitHub https://github.com/containers/podman/issues/11940#issuecomment-1182334600, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3AOCH6CPEVHX7W5MKDT23VTXAQNANCNFSM5F2WUEAA . You are receiving this because you were mentioned.Message ID: @.***>

tyler92 commented 2 years ago

No problem: https://github.com/containers/podman/issues/14921

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.

rhatdan commented 2 years ago

@mheon Any progress on this?

mheon commented 2 years ago

Negative. I don't think we have a good solution yet.

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.

vrothberg commented 1 year ago

I'm using podman play so I don't think there is an option for using systemd with podman play.

@gcs278, running kube pay under systemd is working now. The podman-kube@ systemd template works but I find Quadlet to be better suited:

FWIW, I had another look at the issue. I couldn't see any deadlocks and ps performs much better than back in October '21. Podman's deamonless architecture makes it subject to lock contention which is hitting pretty hard with --restart=always and a failing containers.

vrothberg commented 1 year ago

@rhatdan @mheon I feel like we can close this issue at this point. One thing to consider is to change kube play to stop defaulting to --restart=always in containers. I know it's K8s compat but I find it less appealing for the Podman use cases.

vrothberg commented 1 year ago

Cc: @Luap99 @giuseppe

rhatdan commented 1 year ago

Its funny that we just has a discussion with BU Student where restart always might come in handy. Imagine you have two or more containers in a pod or multiple pods that require services from each other. In Compose you can set what containers need to come up first before a second container starts.

In podman we sequentially start the containers, and if Container A requires Container B, then when container A fails we failed, without starting Container B. If they all started simultaneously then Container A could fail, container B would succeed, and when Container A restarted Container B would be running, and we would get to a good state. I think current design is Contaner A keeps restarting, and Container B does not ever get a chance. I think if we fix this simultanious start, then restart always will make some sense.

vrothberg commented 1 year ago

I'll take a stab and close the issue. As mentioned in https://github.com/containers/podman/issues/11940#issuecomment-1594599927, things have improved considerably since the initial report in Oct '21. Feel free to drop a comment or reopen if you think otherwise.

containers / podman

Podman lock contention when attempting to restart multiple containers #11940