edsantiago commented 1 year ago

Weird new flake, where "new" means "since ExitCleanly() started checking stderr":

# podman [options] run --http-proxy=false -d quay.io/libpod/alpine:latest top
89a002f65e0abc37edb7dd74e7402aa5d3c8e44b348be9b8173b0b4b62c84e7e
# podman [options] stop --ignore foobar 89a002f65e0abc37edb7dd74e7402aa5d3c8e44b348be9b8173b0b4b62c84e7e
time="2023-09-28T10:44:05-05:00" level=warning msg="StopSignal SIGTERM failed to stop container lucid_dubinsky in 10 seconds, resorting to SIGKILL"

Happens almost exclusively in this one test. (The other one, the "pod something", is probably because entrypoint defaults to sh. Probably.).

This shouldn't happen, because top is quick to exit upon signal. And okay, maybe not always, maybe 2-3 seconds, but ten??

I've tried reproducing, with no luck:

$ while :;do cid=$(bin/podman --events-backend=file  run --http-proxy=false  -d quay.io/libpod/alpine:latest top);bin/podman stop --ignore foobar $cid;bin/podman rm $cid;done

Anything obvious I've missed?

fedora-37 : int podman fedora-37 rootless host sqlite
- 09-22 12:10 in Podman stop podman stop --ignore bogus container
fedora-38 : int podman fedora-38 root host sqlite
- 09-28 12:00 in Podman stop podman stop --ignore bogus container
fedora-39 : int podman fedora-39 root host sqlite
- 09-27 21:49 in Podman stop podman stop --ignore bogus container
fedora-39 : int podman fedora-39 rootless host boltdb
- 09-27 15:37 in Podman stop podman stop --ignore bogus container
fedora-39 : int podman fedora-39 rootless host sqlite
- 09-27 08:58 in Podman stop podman stop --ignore bogus container
fedora-39? : int podman fedora-39? root host boltdb
- 09-28 13:13 in Podman stop podman stop --ignore bogus container
rawhide : int podman rawhide root host sqlite
- 09-27 21:53 in Podman stop podman stop --ignore bogus container
- 09-27 15:38 in Podman prune podman system prune with running, exited pod and volume prune set true

Seen in: fedora-37/fedora-38/fedora-39/fedora-39?/rawhide root/rootless boltdb/sqlite

rhatdan commented 1 year ago

Could it be that the system is so slow and overburdened that it is taking more then 10 seconds?

edsantiago commented 1 year ago

It could be, but by my count there are >20 other podman stop commands in e2e tests that do not have -t, and I would expect to see took-too-long errors in at least one of those. What I'm seeing is consistently in the --ignore test.

github-actions[bot] commented 1 year ago

A friendly reminder that this issue had no activity for 30 days.

edsantiago commented 1 year ago

fedora-37 : int podman fedora-37 rootless host sqlite
- 09-22-2023 12:10 in Podman stop podman stop --ignore bogus container
fedora-38 : int podman fedora-38 root host sqlite
- 10-31 17:22 in Podman stop podman stop --ignore bogus container
- 10-31 11:36 in Podman prune podman system prune pods
- 10-26 19:05 in Podman stop podman stop --ignore bogus container
- 10-26 19:05 in Podman stop podman stop container by id
- 10-25 18:39 in Podman stop podman stop --ignore bogus container
- 09-28-2023 12:00 in Podman stop podman stop --ignore bogus container
fedora-38 : int podman fedora-38 rootless host boltdb
- PR #20161
- 10-17 15:44 in Podman stop podman stop --ignore bogus container
- 10-17 15:44 in Podman stop podman stop container by id
- 10-17 15:44 in Podman stop podman stop single container by short id
fedora-38 : int podman fedora-38 rootless host sqlite
- 10-11 10:52 in Podman stop podman stop --ignore bogus container
fedora-39 : int podman fedora-39 root host sqlite
- 09-27-2023 21:49 in Podman stop podman stop --ignore bogus container
fedora-39 : int podman fedora-39 rootless host boltdb
- 09-27-2023 15:37 in Podman stop podman stop --ignore bogus container
fedora-39 : int podman fedora-39 rootless host sqlite
- 09-27-2023 08:58 in Podman stop podman stop --ignore bogus container
fedora-39β : int podman fedora-39β root host boltdb
- 10-25 08:38 in Podman stop podman stop --ignore bogus container
- 09-28-2023 13:13 in Podman stop podman stop --ignore bogus container
rawhide : int podman rawhide root host sqlite
- 09-27-2023 21:53 in Podman stop podman stop --ignore bogus container
- 09-27-2023 15:38 in Podman prune podman system prune with running, exited pod and volume prune set true
rawhide : int podman rawhide rootless host sqlite
- 10-18 23:04 in Podman stop podman stop --ignore bogus container

Seen in: int podman fedora-37+fedora-38+fedora-39+fedora-39β+rawhide root+rootless host boltdb+sqlite

edsantiago commented 11 months ago

Still happening

fedora-37 : int podman fedora-37 rootless host sqlite
- PR #17831
- 09-22-2023 12:10 in Podman stop podman stop --ignore bogus container
fedora-38 : int podman fedora-38 root host boltdb
- PR #17831
- 11-28 15:37 in Podman pod stop podman pod stop single pod by name
- 11-06 11:53 in Podman stop podman stop container by id
- 11-06 11:53 in Podman stop podman stop --ignore bogus container
fedora-38 : int podman fedora-38 root host sqlite
- PR #17831
- 10-31 17:22 in Podman stop podman stop --ignore bogus container
- 10-31-2023 11:36 in Podman prune podman system prune pods
- 10-26-2023 19:05 in Podman stop podman stop --ignore bogus container
- 10-26-2023 19:05 in Podman stop podman stop container by id
- 10-25-2023 18:39 in Podman stop podman stop --ignore bogus container
- 09-28-2023 12:00 in Podman stop podman stop --ignore bogus container
fedora-38 : int podman fedora-38 rootless host boltdb
- PR #20161
- 10-17-2023 15:44 in Podman stop podman stop --ignore bogus container
- 10-17-2023 15:44 in Podman stop podman stop container by id
- 10-17-2023 15:44 in Podman stop podman stop single container by short id
- PR #17831
- 11-06 17:53 in Podman stop podman stop --ignore bogus container
- 11-03 22:07 in Podman prune podman system prune with running, exited pod and volume prune set true
fedora-38 : int podman fedora-38 rootless host sqlite
- PR #17831
- 10-11-2023 10:52 in Podman stop podman stop --ignore bogus container
fedora-39 : int podman fedora-39 root host sqlite
- PR #17831
- 09-27-2023 21:49 in Podman stop podman stop --ignore bogus container
fedora-39 : int podman fedora-39 rootless host boltdb
- PR #17831
- 09-27-2023 15:37 in Podman stop podman stop --ignore bogus container
fedora-39 : int podman fedora-39 rootless host sqlite
- PR #17831
- 09-27-2023 08:58 in Podman stop podman stop --ignore bogus container
fedora-39β : int podman fedora-39β root host boltdb
- PR #17831
- 10-25-2023 08:38 in Podman stop podman stop --ignore bogus container
- 09-28-2023 13:13 in Podman stop podman stop --ignore bogus container
rawhide : int podman rawhide root host sqlite
- PR #17831
- 11-06 07:35 in Podman stop podman stop --ignore bogus container
- 09-27-2023 21:53 in Podman stop podman stop --ignore bogus container
- 09-27-2023 15:38 in Podman prune podman system prune with running, exited pod and volume prune set true
rawhide : int podman rawhide rootless host sqlite
- PR #17831
- 10-18-2023 23:04 in Podman stop podman stop --ignore bogus container

Seen in: int podman fedora-37+fedora-38+fedora-39+fedora-39β+rawhide root+rootless host boltdb+sqlite

Luap99 commented 11 months ago

[+1003s] not ok 298 [160] podman volume rm --force
         # (from function `bail-now' in file test/system/[helpers.bash, line 227](https://github.com/containers/podman/blob/3367fd6095bd2db5c0e60e72abeeae71f43b4e8f/test/system/helpers.bash#L227),
         #  from function `die' in file test/system/[helpers.bash, line 790](https://github.com/containers/podman/blob/3367fd6095bd2db5c0e60e72abeeae71f43b4e8f/test/system/helpers.bash#L790),
         #  from function `run_podman' in file test/system/[helpers.bash, line 435](https://github.com/containers/podman/blob/3367fd6095bd2db5c0e60e72abeeae71f43b4e8f/test/system/helpers.bash#L435),
         #  in test file test/system/[160-volumes.bats, line 158](https://github.com/containers/podman/blob/3367fd6095bd2db5c0e60e72abeeae71f43b4e8f/test/system/160-volumes.bats#L158))
         #   `run_podman volume rm myvol --force' failed
         #
<+     > # # podman rm -t 0 --all --force --ignore
         #
<+051ms> # # podman ps --all --external --format {{.ID}} {{.Names}}
         #
<+052ms> # # podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
<+046ms> # quay.io/libpod/testimage:20221018 f5a99120db64
         #
<+026ms> # # podman volume rm -a
         #
<+050ms> # # podman run -d --volume myvol:/myvol quay.io/libpod/testimage:20221018 top
<+376ms> # 7e8d417666f6f695e575285be058534f41ec6f0f3858fdda7cd632da9bda66fe
         #
<+011ms> # # podman volume rm myvol
<+045ms> # Error: volume myvol is being used by the following container(s): 7e8d417666f6f695e575285be058534f41ec6f0f3858fdda7cd632da9bda66fe: volume is being used
<+005ms> # [ rc=2 (expected) ]
         #
<+014ms> # # podman volume rm myvol --force
<+0010s> # time="2023-12-05T11:50:40-06:00" level=warning msg="StopSignal SIGTERM failed to stop container frosty_hopper in 10 seconds, resorting to SIGKILL"
         # myvol
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #| FAIL: Command succeeded, but issued unexpected warnings
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
         #
<+011ms> # # podman rm -a --volumes

Also seen is sys tests: https://api.cirrus-ci.com/v1/artifact/task/6106357493923840/html/sys-podman-fedora-38-root-host-boltdb.log.html

edsantiago commented 11 months ago

Thanks. I have one other system test failure, on November 22. Here is the catalog so far:

Seen in: int(29)+sys(1) podman(30) fedora-37(1)+fedora-38(20)+fedora-39(3)+fedora-39β(2)+rawhide(4) root(19)+rootless(11) host(30) boltdb(16)+sqlite(14)

That is: never (yet) seen in remote nor in containerized; no difference between boltdb/sqlite; and, mostly VFS but not all.

Luap99 commented 11 months ago

Keep in mind the logrus errors/warnings are on the server side (unless they are logged on the client which most of them aren't) so it makes sense that you do not see these in remote CI logs. Maybe we can teach CI to capture server side logs and error for unexpected stderr there as well. At least in e2e that should be possible as we have one server per test. System test is unlikely to work as the server is spawned outside of the testing setup.

edsantiago commented 11 months ago

Eek. Yes, e2e tests run one server per test, but most tests run a number of podman commands, some with ExitCleanly(), some with Exit(N). I could envision setting up a per-test tracker of those calls, checking it in teardown, and doing a warning check iff there are zero Exit() checks... but I'm not sure it's worth the effort or complexity? Worth discussing, though. Thanks for the idea.

Luap99 commented 11 months ago

I have idea what could be wrong, as pid 1 the program must register signal handlers for SIGTERM otherwise it will ignore it by default. This is what top is doing but because signal handlers are of course part of program and can only be installed after to has been started it could mean that podman stop was run before top was given enough time to install said handlers. It is a very small race but given we have a bunch of weird CI flakes I could imagine that this might be the cause.

edsantiago commented 11 months ago

Good hypothesis. Oh how I hate podman run -d. Oh how I wish someone would take #18952 seriously.

Luap99 commented 11 months ago

A naive reproducer: podman run -d --name test alpine sh -c 'sleep 0.05; top' && podman stop test && podman rm test Of course not the same thing but the sleep 0.05 alone is enough to make it flaky on my laptop.

I guess we need the same podman logs fix to make sure top printed already output before we run podman stop.

Luap99 commented 11 months ago

I guess we need the same podman logs fix to make sure top printed already output before we run podman stop.

Or in cases were we really do not care about the stop behaviour we could just create the container with --stop-signal KILL to always kill right away.

edsantiago commented 9 months ago

21011 has had no effect. This is still one of the most prevalent flakes I'm seeing, and not just in my no-retry PR:

fedora-38 : int podman fedora-38 root host boltdb
- 02-06 08:25 in Podman run podman run --replace
- 02-05 22:49 in Podman prune podman system prune - with dangling images true
- 02-04 18:35 in Podman run podman run with restart policy does not restart on manual stop
- 01-31 11:45 in Podman events podman events with a type
- 01-17 13:38 in Podman events podman events with a type
- 01-15 09:54 in Podman stop podman stop --cidfile
fedora-38 : int podman fedora-38 rootless host boltdb
- 02-06 11:23 in Podman run podman run --replace
- 02-06 10:02 in Podman rm podman rm -fa with dependencies
- 02-06 10:02 in Podman run podman run with restart policy does not restart on manual stop
- 02-03 09:53 in Podman restart podman restart --cidfile
- 02-01 14:56 in Podman run podman run --replace
- 01-28 16:05 in Podman restart podman restart --cidfile
- 01-16 19:11 in Podman stop podman stop --cidfile
fedora-38 : sys podman fedora-38 root host boltdb
- PR #21485
- 02-02 13:07 in [sys] [050] podman stop print IDs or raw input
- PR #21364
- 01-25 12:32 in [sys] [160] podman volume rm --force
- PR #21302
- 01-18 18:13 in [sys] [050] podman stop print IDs or raw input
- PR #21220
- 01-10 09:57 in [sys] [050] podman stop print IDs or raw input

x	x	x	x	x	x
int(13)	podman(17)	fedora-38(17)	root(10)	host(17)	boltdb(17)
sys(4)			rootless(7)

Seems interesting that it's only f38.

I will start writing PRs to run stop -t0.

Luap99 commented 9 months ago

I wouldn't say https://github.com/containers/podman/pull/21011 has had no effect, I have not looked at all logs but the ones I have all are suspect the the mentioned race so the fix was juts not complete. If anything is to believe I would argue the fixed worked as none of the old test names are mentioned in you new comment.

mheon commented 8 months ago

Looking to expand/generalize the solution here - I think most of these are fairly obvious fixes, but podman run --replace is not. We're not directly invoking podman stop there, it's getting invoked from within podman run as part of removing the old container. I don't think the logic for doing that is incorrect, we want to obey the usual stop timeout, but that means we have a race we can't close. Maybe we should ignore that warning message just for run --replace?

edsantiago commented 8 months ago

You mean, ignore in tests? Yes, that's actually my plan. I have a PR in the works to clean up more of these flakes, but (sigh) other more urgent issues keep coming up. Maybe I'll just submit what I have now as a stepping stone.

edsantiago commented 8 months ago

21661

bonjour-py commented 8 months ago

i have same thing here. podman version is 4.3.1 it seems the program can not receive any StopSignal .

rhatdan commented 8 months ago

This just means your script is ignoring the SIGTERM signal.

chris42 commented 6 months ago

I am not sure, if this helps and is correct here: I just had a similar issue, with a container "ignoring" SIGTERM. However when checking, the developer figured, that without giving an explicit --stop-signal=SIGTERM, podman assigned "37" as stop signal to the container. Which did nothing for the included scripts. You find the issue here (in German): https://github.com/jens-maus/RaspberryMatic/issues/2717

Hence I would advice to check which stop signal actually is set in the container, before trying to amend the running processes. I had this issue on Podman 4.9.3 in a rootful container. My rootless ones are all getting "15" per default correctly, no clue where the 37 comes from.

containers / podman

StopSignal SIGTERM failed to stop container in 10 seconds #20196

21011 has had no effect. This is still one of the most prevalent flakes I'm seeing, and not just in my no-retry PR:

21661