containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.69k stars 2.41k forks source link

StopSignal SIGTERM failed to stop container in 10 seconds #20196

Closed edsantiago closed 4 months ago

edsantiago commented 1 year ago

Weird new flake, where "new" means "since ExitCleanly() started checking stderr":

# podman [options] run --http-proxy=false -d quay.io/libpod/alpine:latest top
89a002f65e0abc37edb7dd74e7402aa5d3c8e44b348be9b8173b0b4b62c84e7e
# podman [options] stop --ignore foobar 89a002f65e0abc37edb7dd74e7402aa5d3c8e44b348be9b8173b0b4b62c84e7e
time="2023-09-28T10:44:05-05:00" level=warning msg="StopSignal SIGTERM failed to stop container lucid_dubinsky in 10 seconds, resorting to SIGKILL"

Happens almost exclusively in this one test. (The other one, the "pod something", is probably because entrypoint defaults to sh. Probably.).

This shouldn't happen, because top is quick to exit upon signal. And okay, maybe not always, maybe 2-3 seconds, but ten??

I've tried reproducing, with no luck:

$ while :;do cid=$(bin/podman --events-backend=file  run --http-proxy=false  -d quay.io/libpod/alpine:latest top);bin/podman stop --ignore foobar $cid;bin/podman rm $cid;done

Anything obvious I've missed?

Seen in: fedora-37/fedora-38/fedora-39/fedora-39?/rawhide root/rootless boltdb/sqlite

rhatdan commented 1 year ago

Could it be that the system is so slow and overburdened that it is taking more then 10 seconds?

edsantiago commented 1 year ago

It could be, but by my count there are >20 other podman stop commands in e2e tests that do not have -t, and I would expect to see took-too-long errors in at least one of those. What I'm seeing is consistently in the --ignore test.

github-actions[bot] commented 1 year ago

A friendly reminder that this issue had no activity for 30 days.

edsantiago commented 1 year ago

Seen in: int podman fedora-37+fedora-38+fedora-39+fedora-39β+rawhide root+rootless host boltdb+sqlite

edsantiago commented 11 months ago

Still happening

Seen in: int podman fedora-37+fedora-38+fedora-39+fedora-39β+rawhide root+rootless host boltdb+sqlite

Luap99 commented 11 months ago
[+1003s] not ok 298 [160] podman volume rm --force
         # (from function `bail-now' in file test/system/[helpers.bash, line 227](https://github.com/containers/podman/blob/3367fd6095bd2db5c0e60e72abeeae71f43b4e8f/test/system/helpers.bash#L227),
         #  from function `die' in file test/system/[helpers.bash, line 790](https://github.com/containers/podman/blob/3367fd6095bd2db5c0e60e72abeeae71f43b4e8f/test/system/helpers.bash#L790),
         #  from function `run_podman' in file test/system/[helpers.bash, line 435](https://github.com/containers/podman/blob/3367fd6095bd2db5c0e60e72abeeae71f43b4e8f/test/system/helpers.bash#L435),
         #  in test file test/system/[160-volumes.bats, line 158](https://github.com/containers/podman/blob/3367fd6095bd2db5c0e60e72abeeae71f43b4e8f/test/system/160-volumes.bats#L158))
         #   `run_podman volume rm myvol --force' failed
         #
<+     > # # podman rm -t 0 --all --force --ignore
         #
<+051ms> # # podman ps --all --external --format {{.ID}} {{.Names}}
         #
<+052ms> # # podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
<+046ms> # quay.io/libpod/testimage:20221018 f5a99120db64
         #
<+026ms> # # podman volume rm -a
         #
<+050ms> # # podman run -d --volume myvol:/myvol quay.io/libpod/testimage:20221018 top
<+376ms> # 7e8d417666f6f695e575285be058534f41ec6f0f3858fdda7cd632da9bda66fe
         #
<+011ms> # # podman volume rm myvol
<+045ms> # Error: volume myvol is being used by the following container(s): 7e8d417666f6f695e575285be058534f41ec6f0f3858fdda7cd632da9bda66fe: volume is being used
<+005ms> # [ rc=2 (expected) ]
         #
<+014ms> # # podman volume rm myvol --force
<+0010s> # time="2023-12-05T11:50:40-06:00" level=warning msg="StopSignal SIGTERM failed to stop container frosty_hopper in 10 seconds, resorting to SIGKILL"
         # myvol
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #| FAIL: Command succeeded, but issued unexpected warnings
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
         #
<+011ms> # # podman rm -a --volumes

Also seen is sys tests: https://api.cirrus-ci.com/v1/artifact/task/6106357493923840/html/sys-podman-fedora-38-root-host-boltdb.log.html

edsantiago commented 11 months ago

Thanks. I have one other system test failure, on November 22. Here is the catalog so far:

Seen in: int(29)+sys(1) podman(30) fedora-37(1)+fedora-38(20)+fedora-39(3)+fedora-39β(2)+rawhide(4) root(19)+rootless(11) host(30) boltdb(16)+sqlite(14)

That is: never (yet) seen in remote nor in containerized; no difference between boltdb/sqlite; and, mostly VFS but not all.

Luap99 commented 11 months ago

Keep in mind the logrus errors/warnings are on the server side (unless they are logged on the client which most of them aren't) so it makes sense that you do not see these in remote CI logs. Maybe we can teach CI to capture server side logs and error for unexpected stderr there as well. At least in e2e that should be possible as we have one server per test. System test is unlikely to work as the server is spawned outside of the testing setup.

edsantiago commented 11 months ago

Eek. Yes, e2e tests run one server per test, but most tests run a number of podman commands, some with ExitCleanly(), some with Exit(N). I could envision setting up a per-test tracker of those calls, checking it in teardown, and doing a warning check iff there are zero Exit() checks... but I'm not sure it's worth the effort or complexity? Worth discussing, though. Thanks for the idea.

Luap99 commented 11 months ago

I have idea what could be wrong, as pid 1 the program must register signal handlers for SIGTERM otherwise it will ignore it by default. This is what top is doing but because signal handlers are of course part of program and can only be installed after to has been started it could mean that podman stop was run before top was given enough time to install said handlers. It is a very small race but given we have a bunch of weird CI flakes I could imagine that this might be the cause.

edsantiago commented 11 months ago

Good hypothesis. Oh how I hate podman run -d. Oh how I wish someone would take #18952 seriously.

Luap99 commented 11 months ago

A naive reproducer: podman run -d --name test alpine sh -c 'sleep 0.05; top' && podman stop test && podman rm test Of course not the same thing but the sleep 0.05 alone is enough to make it flaky on my laptop.

I guess we need the same podman logs fix to make sure top printed already output before we run podman stop.

Luap99 commented 11 months ago

I guess we need the same podman logs fix to make sure top printed already output before we run podman stop.

Or in cases were we really do not care about the stop behaviour we could just create the container with --stop-signal KILL to always kill right away.

edsantiago commented 9 months ago

21011 has had no effect. This is still one of the most prevalent flakes I'm seeing, and not just in my no-retry PR:

x x x x x x
int(13) podman(17) fedora-38(17) root(10) host(17) boltdb(17)
sys(4) rootless(7)

Seems interesting that it's only f38.

I will start writing PRs to run stop -t0.

Luap99 commented 9 months ago

I wouldn't say https://github.com/containers/podman/pull/21011 has had no effect, I have not looked at all logs but the ones I have all are suspect the the mentioned race so the fix was juts not complete. If anything is to believe I would argue the fixed worked as none of the old test names are mentioned in you new comment.

mheon commented 8 months ago

Looking to expand/generalize the solution here - I think most of these are fairly obvious fixes, but podman run --replace is not. We're not directly invoking podman stop there, it's getting invoked from within podman run as part of removing the old container. I don't think the logic for doing that is incorrect, we want to obey the usual stop timeout, but that means we have a race we can't close. Maybe we should ignore that warning message just for run --replace?

edsantiago commented 8 months ago

You mean, ignore in tests? Yes, that's actually my plan. I have a PR in the works to clean up more of these flakes, but (sigh) other more urgent issues keep coming up. Maybe I'll just submit what I have now as a stepping stone.

edsantiago commented 8 months ago

21661

bonjour-py commented 8 months ago

i have same thing here. podman version is 4.3.1 it seems the program can not receive any StopSignal . image

rhatdan commented 8 months ago

This just means your script is ignoring the SIGTERM signal.

chris42 commented 6 months ago

I am not sure, if this helps and is correct here: I just had a similar issue, with a container "ignoring" SIGTERM. However when checking, the developer figured, that without giving an explicit --stop-signal=SIGTERM, podman assigned "37" as stop signal to the container. Which did nothing for the included scripts. You find the issue here (in German): https://github.com/jens-maus/RaspberryMatic/issues/2717

Hence I would advice to check which stop signal actually is set in the container, before trying to amend the running processes. I had this issue on Podman 4.9.3 in a rootful container. My rootless ones are all getting "15" per default correctly, no clue where the 37 comes from.