Closed gcs278 closed 1 year ago
This doesn't seem like a deadlock - it more seems like Podman is constantly attempting to restart containers, resulting in at least one container having its lock taken at all times, making ps
take a long time to finish as it waits to acquire locks. After 5 minutes, I haven't been able to replicate a deadlock, though podman ps
is taking upwards of a minute to successfully execute. It is absolutely blowing up the load average as well - loading 8 cores to ~80% I think this is a rather inherent limitation of our daemonless architecture as each command needs to launch a Podman cleanup process to handle restart, which is resulting in a massive process storm - it's why we strongly recommend using systemd-managed containers instead.
Is this a particularly slow system you're testing on? It could explain why things appear to deadlock. I'm fairly convinced there's no actual deadlock here, just a severely taxed system.
Thanks for looking into this @mheon. Yea it's a dedicated server with 48 cores, the deadlocking is somewhat inconsistent for me. I tried it again and couldn't get it to deadlock, but other times, it deadlocks after the first couple restart cycle on 8 containers. I would let podman commands for 5-10 minutes before removing the lock file and killing processes.
I'm using podman play
so I don't think there is an option for using systemd with podman play
.
A friendly reminder that this issue had no activity for 30 days.
A friendly reminder that this issue had no activity for 30 days.
FWIW, I think that podman ps
is way too expensive. The lock of a single container is acquired and released ~ a dozen times just to query certain data (e.g., state, mappings, root FS, etc.). I think we need to optimize querying that data and put it into a single locked function (rather than N locked ones).
I'll take a stab at it.
FWIW, I think that
podman ps
is way too expensive. The lock of a single container is acquired and released ~ a dozen times just to query certain data (e.g., state, mappings, root FS, etc.). I think we need to optimize querying that data and put it into a single locked function (rather than N locked ones).
Scratch that ... these operations are batched.
I was seeing this earlier this week in a slightly different context (podman ps
and podman rm -af
), so I took a further look. Current observations support it being contention of the container locks, which is exacerbated by the amount of parallel processes we run . I believe our algorithm is CPU Cores * 3 +1, which means that on my system, I have 25 threads going for both podman ps
and podman rm
, each contending for CPU time, and each aggressively trying to take locks for containers they are operating on. In short, we aren't waiting on a single lock for a minute, we're waiting on a hundred locks for a second or two each. I don't really know if we can improve this easily.
One thought I have is to print results as they come, instead of all at once when the command is done. This isn't perfect, but it would be a lot more clear to the user what is happening (at least, it will be obvious that the commands are not deadlocked)
Other possible thought: randomize the order in which we act on containers. podman ps
and podman rm
were operating on the same set of containers in the same order, with one being a lot slower than the other, so ps
was run second but caught up quickly and ended up waiting on locks until rm
finished. Random ordering much improves our odds of getting containers that aren't in contention.
I added a bit of randomization to the ordering, but it wasn't enough - no appreciable increase in performance, there are still too many collisions (25 parallel jobs over 200 test containers meant ps
and stop
, for example, are each working on 1/8 of total containers at any given time - high odds for collisions, which cause lock contention, which cause ps
to slow down....)
@mheon, that is a great trail your on.
Maybe we should think in terms of a work pool rather in terms of workers per caller. Could we have a global shared semaphore to limit the number of parallel batch workers? That would limit lock contention etc. AFAIK the locks are already fair.
We do have a semaphore right now, but it's per-process, not global. Making it global is potentially interesting, if we can get a MP-safe shared-memory semaphore.
Shared semaphore looks viable. My only concern is making sure that crashes and SIGKILL don't affect us - if, say, podman stop
is running and using all available jobs, and then gets a SIGKILL, we want the semaphore to be released back to its maximum value.
@mheon Any movement on this?
Negative. Might be worth discussing at the cabal if we have time? I don't have a solid feel on how to fix this.
I have investigated this issue (it reproduces in my case too). Simple program based on shm_lock code shows the following picture:
LockID = 1 (Pod) owner PID = 462221
LockID = 2 (infra container) owner PID = 462221
LockID = 3,(app container) owner PID = 462207
462207 is process, that is started when restart is occurred - podman container cleanup
462221 is any other process, in my case it's podman pod rm -f -a
And these processes are deadlocked because they are waiting each other (lock order problem). The simplest way to reproduce is run the following script:
#!/bin/bash
set -o errexit
for x in {1..10000};
do echo "* $x *"
podman play kube ./my-pod.yaml
podman trace pod rm -f -a
podman trace rm -a
done
where my-pod.yaml
looks like:
apiVersion: v1
kind: Pod
metadata:
labels:
app: my-pod
name: my-pod
spec:
containers:
- name: app
image: debian
imagePullPolicy: Never
command:
- /bin/sleep
args:
- 0.001
hostNetwork: true
restartPolicy: Always
So it looks like we should lock of a container's Pod before lock a container. Is it a good idea?
That is definitely a separate issue, please file a new bug for it
On Tue, Jul 12, 2022 at 15:20 tyler92 @.***> wrote:
So it looks like we should lock of a container's Pod before lock a container. Is it a good idea?
— Reply to this email directly, view it on GitHub https://github.com/containers/podman/issues/11940#issuecomment-1182334600, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3AOCH6CPEVHX7W5MKDT23VTXAQNANCNFSM5F2WUEAA . You are receiving this because you were mentioned.Message ID: @.***>
A friendly reminder that this issue had no activity for 30 days.
@mheon Any progress on this?
Negative. I don't think we have a good solution yet.
A friendly reminder that this issue had no activity for 30 days.
I'm using podman play so I don't think there is an option for using systemd with podman play.
@gcs278, running kube pay
under systemd is working now. The podman-kube@
systemd template works but I find Quadlet to be better suited:
FWIW, I had another look at the issue. I couldn't see any deadlocks and ps
performs much better than back in October '21. Podman's deamonless architecture makes it subject to lock contention which is hitting pretty hard with --restart=always
and a failing containers.
@rhatdan @mheon I feel like we can close this issue at this point. One thing to consider is to change kube play
to stop defaulting to --restart=always
in containers. I know it's K8s compat but I find it less appealing for the Podman use cases.
Cc: @Luap99 @giuseppe
Its funny that we just has a discussion with BU Student where restart always might come in handy. Imagine you have two or more containers in a pod or multiple pods that require services from each other. In Compose you can set what containers need to come up first before a second container starts.
In podman we sequentially start the containers, and if Container A requires Container B, then when container A fails we failed, without starting Container B. If they all started simultaneously then Container A could fail, container B would succeed, and when Container A restarted Container B would be running, and we would get to a good state. I think current design is Contaner A keeps restarting, and Container B does not ever get a chance. I think if we fix this simultanious start, then restart always will make some sense.
I'll take a stab and close the issue. As mentioned in https://github.com/containers/podman/issues/11940#issuecomment-1594599927, things have improved considerably since the initial report in Oct '21. Feel free to drop a comment or reopen if you think otherwise.
Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)
/kind bug
Description
With a restart policy as
always
oron-failed
, podman seems to really struggle and potentially deadlock when it is restarting multiple containers that are constantly exiting. I first noticed this problem with usingpodman play kube
where a couple containers wereconstantly dying and the restart policy wasalways
. I then added an script with justexit 1
as the entrypoint and watched podman commands being to hang longer.I started 8 instances of
exit 1
and--restart=always
containers viapodman run
and podman commands took around 60 seconds to return. After about a minute, podman seemed to deadlock. Podman commands weren't returning and I couldn't stop any of the dying containers. Irm -f /dev/shm/libpod_lock
and did apkill podman
to release the deadlock.This is a big problem for us, as we can't trust podman to restart containers without deadlocking. This seems related to #11589, but I thought it would be better to separately track since it's a different situation.
Steps to reproduce the issue:
Describe the results you received: Podman gets extremely sluggish and then deadlocks
Describe the results you expected: Podman wouldn't deadlock
Additional information you deem important (e.g. issue happens only occasionally):
Output of
podman version
:Output of
podman info --debug
:Package info (e.g. output of
rpm -q podman
orapt list podman
):Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/master/troubleshooting.md)
Yes
Additional environment details (AWS, VirtualBox, physical, etc.):