Closed edsantiago closed 4 months ago
Could it be that the system is so slow and overburdened that it is taking more then 10 seconds?
It could be, but by my count there are >20 other podman stop
commands in e2e tests that do not have -t
, and I would expect to see took-too-long errors in at least one of those. What I'm seeing is consistently in the --ignore
test.
A friendly reminder that this issue had no activity for 30 days.
Seen in: int podman fedora-37+fedora-38+fedora-39+fedora-39β+rawhide root+rootless host boltdb+sqlite
Still happening
Seen in: int podman fedora-37+fedora-38+fedora-39+fedora-39β+rawhide root+rootless host boltdb+sqlite
[+1003s] not ok 298 [160] podman volume rm --force
# (from function `bail-now' in file test/system/[helpers.bash, line 227](https://github.com/containers/podman/blob/3367fd6095bd2db5c0e60e72abeeae71f43b4e8f/test/system/helpers.bash#L227),
# from function `die' in file test/system/[helpers.bash, line 790](https://github.com/containers/podman/blob/3367fd6095bd2db5c0e60e72abeeae71f43b4e8f/test/system/helpers.bash#L790),
# from function `run_podman' in file test/system/[helpers.bash, line 435](https://github.com/containers/podman/blob/3367fd6095bd2db5c0e60e72abeeae71f43b4e8f/test/system/helpers.bash#L435),
# in test file test/system/[160-volumes.bats, line 158](https://github.com/containers/podman/blob/3367fd6095bd2db5c0e60e72abeeae71f43b4e8f/test/system/160-volumes.bats#L158))
# `run_podman volume rm myvol --force' failed
#
<+ > # # podman rm -t 0 --all --force --ignore
#
<+051ms> # # podman ps --all --external --format {{.ID}} {{.Names}}
#
<+052ms> # # podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
<+046ms> # quay.io/libpod/testimage:20221018 f5a99120db64
#
<+026ms> # # podman volume rm -a
#
<+050ms> # # podman run -d --volume myvol:/myvol quay.io/libpod/testimage:20221018 top
<+376ms> # 7e8d417666f6f695e575285be058534f41ec6f0f3858fdda7cd632da9bda66fe
#
<+011ms> # # podman volume rm myvol
<+045ms> # Error: volume myvol is being used by the following container(s): 7e8d417666f6f695e575285be058534f41ec6f0f3858fdda7cd632da9bda66fe: volume is being used
<+005ms> # [ rc=2 (expected) ]
#
<+014ms> # # podman volume rm myvol --force
<+0010s> # time="2023-12-05T11:50:40-06:00" level=warning msg="StopSignal SIGTERM failed to stop container frosty_hopper in 10 seconds, resorting to SIGKILL"
# myvol
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #| FAIL: Command succeeded, but issued unexpected warnings
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#
<+011ms> # # podman rm -a --volumes
Also seen is sys tests: https://api.cirrus-ci.com/v1/artifact/task/6106357493923840/html/sys-podman-fedora-38-root-host-boltdb.log.html
Thanks. I have one other system test failure, on November 22. Here is the catalog so far:
Seen in: int(29)+sys(1) podman(30) fedora-37(1)+fedora-38(20)+fedora-39(3)+fedora-39β(2)+rawhide(4) root(19)+rootless(11) host(30) boltdb(16)+sqlite(14)
That is: never (yet) seen in remote nor in containerized; no difference between boltdb/sqlite; and, mostly VFS but not all.
Keep in mind the logrus errors/warnings are on the server side (unless they are logged on the client which most of them aren't) so it makes sense that you do not see these in remote CI logs. Maybe we can teach CI to capture server side logs and error for unexpected stderr there as well. At least in e2e that should be possible as we have one server per test. System test is unlikely to work as the server is spawned outside of the testing setup.
Eek. Yes, e2e tests run one server per test, but most tests run a number of podman commands, some with ExitCleanly()
, some with Exit(N)
. I could envision setting up a per-test tracker of those calls, checking it in teardown, and doing a warning check iff there are zero Exit()
checks... but I'm not sure it's worth the effort or complexity? Worth discussing, though. Thanks for the idea.
I have idea what could be wrong, as pid 1 the program must register signal handlers for SIGTERM otherwise it will ignore it by default. This is what top is doing but because signal handlers are of course part of program and can only be installed after to has been started it could mean that podman stop was run before top was given enough time to install said handlers. It is a very small race but given we have a bunch of weird CI flakes I could imagine that this might be the cause.
Good hypothesis. Oh how I hate podman run -d
. Oh how I wish someone would take #18952 seriously.
A naive reproducer: podman run -d --name test alpine sh -c 'sleep 0.05; top' && podman stop test && podman rm test
Of course not the same thing but the sleep 0.05
alone is enough to make it flaky on my laptop.
I guess we need the same podman logs fix to make sure top printed already output before we run podman stop.
I guess we need the same podman logs fix to make sure top printed already output before we run podman stop.
Or in cases were we really do not care about the stop behaviour we could just create the container with --stop-signal KILL
to always kill right away.
x | x | x | x | x | x |
---|---|---|---|---|---|
int(13) | podman(17) | fedora-38(17) | root(10) | host(17) | boltdb(17) |
sys(4) | rootless(7) |
Seems interesting that it's only f38.
I will start writing PRs to run stop -t0
.
I wouldn't say https://github.com/containers/podman/pull/21011 has had no effect, I have not looked at all logs but the ones I have all are suspect the the mentioned race so the fix was juts not complete. If anything is to believe I would argue the fixed worked as none of the old test names are mentioned in you new comment.
Looking to expand/generalize the solution here - I think most of these are fairly obvious fixes, but podman run --replace
is not. We're not directly invoking podman stop
there, it's getting invoked from within podman run
as part of removing the old container. I don't think the logic for doing that is incorrect, we want to obey the usual stop timeout, but that means we have a race we can't close. Maybe we should ignore that warning message just for run --replace
?
You mean, ignore in tests? Yes, that's actually my plan. I have a PR in the works to clean up more of these flakes, but (sigh) other more urgent issues keep coming up. Maybe I'll just submit what I have now as a stepping stone.
i have same thing here. podman version is 4.3.1 it seems the program can not receive any StopSignal .
This just means your script is ignoring the SIGTERM signal.
I am not sure, if this helps and is correct here: I just had a similar issue, with a container "ignoring" SIGTERM. However when checking, the developer figured, that without giving an explicit --stop-signal=SIGTERM, podman assigned "37" as stop signal to the container. Which did nothing for the included scripts. You find the issue here (in German): https://github.com/jens-maus/RaspberryMatic/issues/2717
Hence I would advice to check which stop signal actually is set in the container, before trying to amend the running processes. I had this issue on Podman 4.9.3 in a rootful container. My rootless ones are all getting "15" per default correctly, no clue where the 37 comes from.
Weird new flake, where "new" means "since ExitCleanly() started checking stderr":
Happens almost exclusively in this one test. (The other one, the "pod something", is probably because entrypoint defaults to
sh
. Probably.).This shouldn't happen, because
top
is quick to exit upon signal. And okay, maybe not always, maybe 2-3 seconds, but ten??I've tried reproducing, with no luck:
Anything obvious I've missed?
Seen in: fedora-37/fedora-38/fedora-39/fedora-39?/rawhide root/rootless boltdb/sqlite