Closed edsantiago closed 3 months ago
This error is new with my c/common network rework. Although I am very sure the bug is not there, I suspect for a very long time (also hinting from other flakes) that it is possible that we cleanup twice. I just assume that the ENOENT flakes from https://github.com/containers/podman/issues/19721 can now look like this.
https://api.cirrus-ci.com/v1/artifact/task/6507554541404160/html/int-podman-fedora-39-rootless-host-sqlite.log.html#t--Podman-kube-play-test-with-reserved-init-annotation-in-yaml--1 also shows it together with the other errors that indicate the double cleanup.
All quiet since initial filing, but just saw this one, f38. The "No such device" error is new:
<+025ms> # $ podman pod rm test_pod
<+316ms> # time="2024-02-27T14:03:06-06:00" level=error msg="rootless netns ref counter out of sync, counter is at -1, resetting it back to 0"
# time="2024-02-27T14:03:06-06:00" level=error msg="Unable to clean up network for container e8108e2498248fbcea80089edf4917ba133c8c2f4579212fde2509c9382f12cd: \"1 error occurred:\\n\\t* netavark: netavark encountered multiple errors:\\n\\t- remove aardvark entries: IO error: No such file or directory (os error 2)\\n\\t- failed to delete container veth eth0: Netlink error: No such device (os error 19)\\n\\n\""
After the merge of #18442, this is now happening in AfterEach
yeah once I have a bit more time I give you some diff for your flake retry PR to capture the full stack traces from within the rootless netns code to see how we end up there in cleanup twice by adding a ton of debugging code that should not be merged into main.
Checking in. Still rootless-only. Happening even with last week's VMs (pasta 2024-03-20).
x | x | x | x | x | x |
---|---|---|---|---|---|
int(8) | podman(13) | fedora-39(5) | rootless(13) | host(13) | sqlite(10) |
sys(5) | rawhide(4) | boltdb(3) | |||
fedora-38(3) | |||||
fedora-39β(1) |
Ping, hi. Here's one with two failure messages:
<+010ms> # $ podman pod rm test_pod
<+235ms> # time="2024-05-15T13:02:45Z" level=error msg="rootless netns ref counter out of sync, counter is at -1, resetting it back to 0"
# time="2024-05-15T13:02:45Z" level=error msg="Unable to clean up network for container 65a4e407b7384fbc341a6c0bf538968e2c1aae528d0c5d635ac2c17cf4397203: \"1 error occurred:\\n\\t* netavark: netavark encountered multiple errors:\\n\\t- IO error: No such file or directory (os error 2)\\n\\t- failed to delete container veth eth0: Netlink error: No such device (os error 19)\\n\\n\""
The real failure is probably "unable to clean up network", which could be #19721 or could be yet another something new.
Current flake list (may be incomplete, because some of these may be filed under other issues)
x | x | x | x | x | x |
---|---|---|---|---|---|
int(9) | podman(15) | fedora-39(7) | rootless(15) | host(15) | sqlite(12) |
sys(6) | rawhide(4) | boltdb(3) | |||
fedora-38(3) | |||||
debian-13(1) |
The real error is something calls network cleanup twice AFAICT, which also applies to #https://github.com/containers/podman/issues/19721 I see all kube play test linked here so that at least gives me some point to start with I guess.
Any I really need full stack traces of both process calling cleanup but there is no good way of getting this as likely only one of them errors out like that so I cannot include in the error message.
This one is really blowing up lately. Maybe the new CI VMs, or maybe the tmpfs PR, or maybe something else entirely. Last 30 days:
x | x | x | x | x | x |
---|---|---|---|---|---|
sys(10) | podman(11) | debian-13(4) | rootless(11) | host(11) | sqlite(7) |
int(1) | fedora-39(4) | boltdb(4) | |||
rawhide(2) | |||||
fedora-40(1) |
This one always seems to happen along with "Unable to clean up network" (#19721). For example:
<+011ms> # $ podman pod rm p-t645-r4ya3vwa
<+322ms> # time="2024-07-23T19:42:19Z" level=error msg="rootless netns ref counter out of sync, counter is at -1, resetting it back to 0"
# time="2024-07-23T19:42:20Z" level=error msg="Unable to clean up network for container b60e5ea24ca3b2ddba89ffaf606c7dce5defe6dfc3a20f4c2fa8c836969a24aa: \"1 error occurred:\\n\\t* netavark: IO error: No such file or directory (os error 2)\\n\\n\""
# b7c966087f24a711938bcf3cb8e4851dec813f14be5b5973d8da5fd91f4ff6bc
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #| FAIL: Command succeeded, but issued unexpected warnings
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Are they the same bug? Should I close one in favor of the other, and assign them all to the same issue?
x | x | x | x | x | x |
---|---|---|---|---|---|
sys(24) | podman(40) | fedora-39(12) | rootless(40) | host(40) | sqlite(31) |
int(16) | debian-13(10) | boltdb(9) | |||
rawhide(8) | |||||
fedora-40(7) | |||||
fedora-38(3) |
My guess is that they have the same cause, that said I also don't understand what is going on other than we are trying to cleanup the networks twice so I cannot be sure about that.
Anyhow looks like this has gotten much worse so I put this on my priority list for next week.
Still happening frequently, even in non-Ed PRs. Here's one new error message I hadn't seen before (the middle one, aardvark netns):
<+024ms> # $ podman pod rm p-t115-2c0yuhvq
<+294ms> # time="2024-07-19T07:08:01-05:00" level=error msg="rootless netns ref counter out of sync, counter is at -1, resetting it back to 0"
# [ERROR netavark::dns::aardvark] aardvark-dns runs in a different netns, dns will not work for this container. To resolve please stop all containers, kill the aardvark-dns process, remove the /run/user/2415/containers/networks/aardvark-dns directory and then start the containers again
# time="2024-07-19T07:08:01-05:00" level=error msg="Unable to clean up network for container 6b7b957f005d00b8d66cfd93b3fe7393c871ce858708411cdba8e77d394855c4: \"1 error occurred:\\n\\t* netavark: failed to get bridge interface: Netlink error: No such device (os error 19)\\n\\n\""
This was in my parallel-bats PR
I have been looking at the errors for hours today without having any solid idea what is causing this stuff.
I found one issue around podman unshare --rootless-netns
handling whent he command there exits > 0
ERRO[0005] rootless netns ref counter out of sync, counter is at -1, resetting it back to 0
ERRO[0005] Failed to decrement ref count: rootless netns: write ref counter: open /run/user/1000/containers/networks/rootless-netns/ref-count: no such file or directory
But this is not directly related to the symptoms here. It is very clear from the error sequence that something is calling cleanupNetwork() twice but this should practically be impossible unless the cleanup process was killed after it called netavark but before it wrote the netns to the db (I don't see that happening though).
Your new error looks even weirder. Because netavark tries to delete the veth interface first (and there is no error logged about it so it worked) and only then the bridge (which failed). Given the aardvark-dns warning what happened is that something deleted the rootless-netns even though it was still in use then the cleanup created a new one and then no longer finding the expected interfaces there... So in short something decrementing the ref counter more than it should which well is what we are seeing in the other errors as well...
It would be great if we manage to get a local reproducer, I tried running the commands from the failed tests without luck.
It only fails in kube-related tests. Eliminating int tests and my bats-parallel PR, I see failures only in 700.bats
in the podman play --service-container
test. When I look there, I see a timeout $PODMAN ... &>/dev/null &
. I will look into capturing and displaying that log.
That sounds like a good idea
Well, that was useless:
# cat /tmp/podman-kube-bg.log:
# Pod:
# 8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063
# Container:
# 66f79a26bb2ba4110fc21dfd0bfac8fe5a07bda18e2c6aaf7524952a33d14152
#
#
<+015ms> # $ podman pod rm p-t645-xt1p2kpr
<+125ms> # time="2024-07-29T19:15:19Z" level=error msg="rootless netns ref counter out of sync, counter is at -1, resetting it back to 0"
# time="2024-07-29T19:15:19Z" level=error msg="Unable to clean up network for container 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73: \"1 error occurred:\\n\\t* netavark: netavark encountered multiple errors:\\n\\t- IO error: No such file or directory (os error 2)\\n\\t- failed to delete container veth eth0: Netlink error: No such device (os error 19)\\n\\n\""
# 8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063
Looking at the journal (search in-page for 8b00d6
) I see this sequence:
Jul 29 19:15:19 cirrus-task-4883090584109056 kernel: podman1: port 1(veth0) entered disabled state
Jul 29 19:15:19 cirrus-task-4883090584109056 kernel: veth0 (unregistering): left allmulticast mode
Jul 29 19:15:19 cirrus-task-4883090584109056 kernel: veth0 (unregistering): left promiscuous mode
Jul 29 19:15:19 cirrus-task-4883090584109056 kernel: podman1: port 1(veth0) entered disabled state
Jul 29 19:15:19 cirrus-task-4883090584109056 systemd[3945]: Started podman-187478.scope.
....
Jul 29 19:15:19 cirrus-task-4883090584109056 podman[187507]: 2024-07-29 19:15:19.708555188 +0000 UTC m=+0.071896479 container remove 66f79a26bb2ba4110fc21dfd0bfac8fe5a07bda18e2c6aaf7524952a33d14152 (image=quay.io/libpod/testimage:20240123, name=p-t645-xt1p2kpr-c-t645-xt1p2kpr, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, app=test, created_at=2024-01-24T18:38:28Z, created_by=test/system/build-testimage, io.buildah.version=1.34.1-dev)
Jul 29 19:15:19 cirrus-task-4883090584109056 podman[187507]: 2024-07-29 19:15:19.731089306 +0000 UTC m=+0.094430598 container remove 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
Jul 29 19:15:19 cirrus-task-4883090584109056 podman[187507]: 2024-07-29 19:15:19.738002418 +0000 UTC m=+0.101343711 pod remove 8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063 (image=, name=p-t645-xt1p2kpr)
...that is, shut down veth0 before the pod-rm. And some systemd stuff nearby. Is it possible that systemd is engaging in some unwanted cleanup/skullduggery?
well cleanup happens when the ctr stops not when it is removed so it is expected that interfaces are removed before the actual container/pod is removed.
What is interesting here is to look for 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73
in the journal which is the infra ctr id.
podman[187012]: 2024-07-29 19:15:16.973147745 +0000 UTC m=+0.320669136 container create 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
systemd[3945]: Started libpod-conmon-38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73.scope.
systemd[3945]: Started libpod-38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73.scope - libcrun container.
podman[187012]: 2024-07-29 19:15:17.273375993 +0000 UTC m=+0.620897395 container init 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
podman[187012]: 2024-07-29 19:15:17.27937793 +0000 UTC m=+0.626899335 container start 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
conmon[187127]: conmon 38ac0841b08c00a8ab84 <nwarn>: Failed to open cgroups file: /sys/fs/cgroup/user.slice/user-5174.slice/user@5174.service/user.slice/user-libpod_pod_8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063.slice/libpod-38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73.scope/container/memory.events
podman[187198]: 2024-07-29 19:15:17.860726101 +0000 UTC m=+0.125663852 container died 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, io.buildah.version=1.37.0-dev)
podman[187198]: 2024-07-29 19:15:18.132708377 +0000 UTC m=+0.397646117 container cleanup 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
podman[187306]: 2024-07-29 19:15:18.647994364 +0000 UTC m=+0.109348535 container restart 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
systemd[3945]: Started libpod-conmon-38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73.scope.
systemd[3945]: Started libpod-38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73.scope - libcrun container.
podman[187306]: 2024-07-29 19:15:18.856200067 +0000 UTC m=+0.317554231 container init 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
podman[187306]: 2024-07-29 19:15:18.860850626 +0000 UTC m=+0.322204787 container start 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
podman[187440]: 2024-07-29 19:15:19.342933574 +0000 UTC m=+0.070542825 container kill 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
conmon[187374]: conmon 38ac0841b08c00a8ab84 <nwarn>: Failed to open cgroups file: /sys/fs/cgroup/user.slice/user-5174.slice/user@5174.service/user.slice/user-libpod_pod_8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063.slice/libpod-38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73.scope/container/memory.events
podman[187440]: 2024-07-29 19:15:19.381872905 +0000 UTC m=+0.109482177 container died 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, io.buildah.version=1.37.0-dev)
systemd[3945]: Stopping libpod-conmon-38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73.scope...
systemd[3945]: Stopped libpod-conmon-38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73.scope.
podman[187507]: 2024-07-29 19:15:19.731089306 +0000 UTC m=+0.094430598 container remove 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
The important bit here is that there is no cleanup event after it got restarted and killed. Generally we expect
container died
container cleanup
But in this case there is no cleanup event at all suggesting that something went wrong during cleanup (possibly the process was killed or we hit some error but neither is logged anywhere in the journal so it is impossible to know) The the pod rm command sees the container and thinks cleanup wasn't done yet and tries again triggering the error.
Reproducer! Not a great one, complicated to set up yadda yadda, but here goes:
test/system/helpers.bash
, add return
to the very top of leak_check()
and clean_setup()
test/system/700*.bats
, rename teardown()
to cecinestpasteardown()
or whatever. Just so it does not get called.$ while :;do hack/bats --rootless 700:service-conta || break;done
Reproduces the out of sync
error within minutes for me.
Then instrument podman to your heart's content.
HTH
Seen mostly in system tests, BUT ALSO IN E2E TESTS WHERE IT IS IMPOSSIBLE TO DETECT. Because it happens in test cleanup, where
rm
does not trigger a test failure. So I caught all these by accident. THIS IS HAPPENING A LOT MORE THAN WE ARE AWARE OF.