edsantiago commented 9 months ago

<+012ms> # $ podman pod rm test_pod
<+328ms> # time="2024-02-08T08:28:16-06:00" level=error msg="rootless netns ref counter out of sync, counter is at -1, resetting it back to 0"
         # 2ef9c5c99f00484285a5a7394e9d276d97c436311f197e7d2a0fa05ea78e69b5

Seen mostly in system tests, BUT ALSO IN E2E TESTS WHERE IT IS IMPOSSIBLE TO DETECT. Because it happens in test cleanup, where rm does not trigger a test failure. So I caught all these by accident. THIS IS HAPPENING A LOT MORE THAN WE ARE AWARE OF.

fedora-38 : sys podman fedora-38 rootless host boltdb
- PR #21535
- 02-08 09:30 in [sys] [700] podman play --service-container
fedora-39 : int podman fedora-39 rootless host sqlite
- 02-06 08:21 in Podman kube play test with reserved init annotation in yaml
- 01-30 09:16 in Podman kube play override with tcp should keep udp from YAML file
- 01-18 08:40 in Podman kube play secret as volume support - multiple volumes
fedora-39β : sys podman fedora-39β rootless host sqlite
- PR #20161
- 10-17-2023 15:39 in [sys] podman play --service-container
rawhide : int podman rawhide rootless host sqlite
- 02-06 11:21 in Podman kube play secret as volume support - multiple volumes
- 02-04 12:39 in Podman kube play test restartPolicy
rawhide : sys podman rawhide rootless host sqlite
- PR #21246
- 01-12 13:11 in [sys] [700] podman play

x	x	x	x	x	x
int(5)	podman(8)	rawhide(3)	rootless(8)	host(8)	sqlite(7)
sys(3)		fedora-39(3)			boltdb(1)
		fedora-39β(1)
		fedora-38(1)

Luap99 commented 9 months ago

This error is new with my c/common network rework. Although I am very sure the bug is not there, I suspect for a very long time (also hinting from other flakes) that it is possible that we cleanup twice. I just assume that the ENOENT flakes from https://github.com/containers/podman/issues/19721 can now look like this.

https://api.cirrus-ci.com/v1/artifact/task/6507554541404160/html/int-podman-fedora-39-rootless-host-sqlite.log.html#t--Podman-kube-play-test-with-reserved-init-annotation-in-yaml--1 also shows it together with the other errors that indicate the double cleanup.

edsantiago commented 8 months ago

All quiet since initial filing, but just saw this one, f38. The "No such device" error is new:

<+025ms> # $ podman pod rm test_pod
<+316ms> # time="2024-02-27T14:03:06-06:00" level=error msg="rootless netns ref counter out of sync, counter is at -1, resetting it back to 0"
         # time="2024-02-27T14:03:06-06:00" level=error msg="Unable to clean up network for container e8108e2498248fbcea80089edf4917ba133c8c2f4579212fde2509c9382f12cd: \"1 error occurred:\\n\\t* netavark: netavark encountered multiple errors:\\n\\t- remove aardvark entries: IO error: No such file or directory (os error 2)\\n\\t- failed to delete container veth eth0: Netlink error: No such device (os error 19)\\n\\n\""

edsantiago commented 8 months ago

After the merge of #18442, this is now happening in AfterEach

Luap99 commented 8 months ago

yeah once I have a bit more time I give you some diff for your flake retry PR to capture the full stack traces from within the rootless netns code to see how we end up there in cleanup twice by adding a ton of debugging code that should not be merged into main.

edsantiago commented 7 months ago

Checking in. Still rootless-only. Happening even with last week's VMs (pasta 2024-03-20).

fedora-38 : sys podman fedora-38 rootless host boltdb
- PR #21846
- 02-27 15:05 in [sys] [700] podman play --service-container
- PR #21535
- 02-08-2024 09:30 in [sys] [700] podman play --service-container
- 03-05 21:31 in [sys] [700] podman play --service-container
fedora-39 : int podman fedora-39 rootless host sqlite
- 03-20 11:37 in TOP-LEVEL [AfterEach] Podman kube play test volumes-from annotation with source containers external
- 03-04 09:06 in TOP-LEVEL [AfterEach] Podman kube play test volumes-from annotation with source container in pod
- 02-06-2024 08:21 in Podman kube play test with reserved init annotation in yaml
- 01-30-2024 09:16 in Podman kube play override with tcp should keep udp from YAML file
- 01-18-2024 08:40 in Podman kube play secret as volume support - multiple volumes
fedora-39β : sys podman fedora-39β rootless host sqlite
- PR #20161
- 10-17-2023 15:39 in [sys] podman play --service-container
rawhide : int podman rawhide rootless host sqlite
- 03-23 21:49 in TOP-LEVEL [AfterEach] Podman kube generate with volume
- 02-06-2024 11:21 in Podman kube play secret as volume support - multiple volumes
- 02-04-2024 12:39 in Podman kube play test restartPolicy
rawhide : sys podman rawhide rootless host sqlite
- PR #21246
- 01-12-2024 13:11 in [sys] [700] podman play

x	x	x	x	x	x
int(8)	podman(13)	fedora-39(5)	rootless(13)	host(13)	sqlite(10)
sys(5)		rawhide(4)			boltdb(3)
		fedora-38(3)
		fedora-39β(1)

edsantiago commented 6 months ago

Ping, hi. Here's one with two failure messages:

<+010ms> # $ podman pod rm test_pod
<+235ms> # time="2024-05-15T13:02:45Z" level=error msg="rootless netns ref counter out of sync, counter is at -1, resetting it back to 0"
         # time="2024-05-15T13:02:45Z" level=error msg="Unable to clean up network for container 65a4e407b7384fbc341a6c0bf538968e2c1aae528d0c5d635ac2c17cf4397203: \"1 error occurred:\\n\\t* netavark: netavark encountered multiple errors:\\n\\t- IO error: No such file or directory (os error 2)\\n\\t- failed to delete container veth eth0: Netlink error: No such device (os error 19)\\n\\n\""

The real failure is probably "unable to clean up network", which could be #19721 or could be yet another something new.

Current flake list (may be incomplete, because some of these may be filed under other issues)

debian-13 : sys podman debian-13 rootless host sqlite
- 05-15 09:06 in [sys] [700] podman play --service-container
fedora-38 : sys podman fedora-38 rootless host boltdb
- PR #21846
- 02-27-2024 15:05 in [sys] [700] podman play --service-container
- PR #21535
- 02-08-2024 09:30 in [sys] [700] podman play --service-container
- 03-05-2024 21:31 in [sys] [700] podman play --service-container
fedora-39 : int podman fedora-39 rootless host sqlite
- 03-26-2024 14:31 in TOP-LEVEL [AfterEach] Podman kube play multi doc yaml with multiple services, pods and deployments
- 03-20-2024 11:37 in TOP-LEVEL [AfterEach] Podman kube play test volumes-from annotation with source containers external
- 03-04-2024 09:06 in TOP-LEVEL [AfterEach] Podman kube play test volumes-from annotation with source container in pod
- 02-06-2024 08:21 in Podman kube play test with reserved init annotation in yaml
- 01-30-2024 09:16 in Podman kube play override with tcp should keep udp from YAML file
- 01-18-2024 08:40 in Podman kube play secret as volume support - multiple volumes
fedora-39 : sys podman fedora-39 rootless host sqlite
- PR #22277
- 04-05-2024 09:42 in [sys] [700] podman play --service-container
rawhide : int podman rawhide rootless host sqlite
- 03-23-2024 21:49 in TOP-LEVEL [AfterEach] Podman kube generate with volume
- 02-06-2024 11:21 in Podman kube play secret as volume support - multiple volumes
- 02-04-2024 12:39 in Podman kube play test restartPolicy
rawhide : sys podman rawhide rootless host sqlite
- PR #21246
- 01-12-2024 13:11 in [sys] [700] podman play

x	x	x	x	x	x
int(9)	podman(15)	fedora-39(7)	rootless(15)	host(15)	sqlite(12)
sys(6)		rawhide(4)			boltdb(3)
		fedora-38(3)
		debian-13(1)

Luap99 commented 6 months ago

The real error is something calls network cleanup twice AFAICT, which also applies to #https://github.com/containers/podman/issues/19721 I see all kube play test linked here so that at least gives me some point to start with I guess.

Any I really need full stack traces of both process calling cleanup but there is no good way of getting this as likely only one of them errors out like that so I cannot include in the error message.

edsantiago commented 4 months ago

This one is really blowing up lately. Maybe the new CI VMs, or maybe the tmpfs PR, or maybe something else entirely. Last 30 days:

debian-13 : int podman debian-13 rootless host sqlite
- PR #22726
- 07-08 13:44 in Podman kube play test with reserved init annotation in yaml
debian-13 : sys podman debian-13 rootless host sqlite
- PR #23200
- 07-04 19:02 in [sys] [700] podman play --service-container
- PR #22985
- 07-09 05:59 in [sys] [700] podman play --service-container
- PR #22726
- 07-08 11:38 in [sys] [700] podman play --service-container
fedora-39 : sys podman fedora-39 rootless host boltdb
- PR #23203
- 07-05 05:14 in [sys] [700] podman play --service-container
- PR #22985
- 07-09 05:58 in [sys] [700] podman play --service-container
- PR #22736
- 06-24 14:19 in [sys] [700] podman play --service-container
- PR #22726
- 07-02 15:27 in [sys] [700] podman play --service-container
fedora-40 : sys podman fedora-40 rootless host sqlite
- PR #22726
- 07-09 09:11 in [sys] [700] podman play --service-container
rawhide : sys podman rawhide rootless host sqlite
- PR #23220
- 07-08 05:40 in [sys] [700] podman play --service-container
- PR #22726
- 07-08 18:12 in [sys] [700] podman play --service-container

x	x	x	x	x	x
sys(10)	podman(11)	debian-13(4)	rootless(11)	host(11)	sqlite(7)
int(1)		fedora-39(4)			boltdb(4)
		rawhide(2)
		fedora-40(1)

edsantiago commented 3 months ago

This one always seems to happen along with "Unable to clean up network" (#19721). For example:

<+011ms> # $ podman pod rm p-t645-r4ya3vwa
<+322ms> # time="2024-07-23T19:42:19Z" level=error msg="rootless netns ref counter out of sync, counter is at -1, resetting it back to 0"
         # time="2024-07-23T19:42:20Z" level=error msg="Unable to clean up network for container b60e5ea24ca3b2ddba89ffaf606c7dce5defe6dfc3a20f4c2fa8c836969a24aa: \"1 error occurred:\\n\\t* netavark: IO error: No such file or directory (os error 2)\\n\\n\""
         # b7c966087f24a711938bcf3cb8e4851dec813f14be5b5973d8da5fd91f4ff6bc
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #| FAIL: Command succeeded, but issued unexpected warnings
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Are they the same bug? Should I close one in favor of the other, and assign them all to the same issue?

debian-13 : int podman debian-13 rootless host sqlite
- PR #22726
- 07-08 13:44 in Podman kube play test with reserved init annotation in yaml
- 07-11 13:43 in TOP-LEVEL [AfterEach] Podman kube play test with reserved privileged annotation in yaml
debian-13 : sys podman debian-13 rootless host sqlite
- PR #23378
- 07-23 15:44 in [sys] [700] podman play --service-container
- PR #23275
- 07-23 17:30 in [sys] [700] podman play --service-container
- PR #23212
- 07-08 12:04 in [sys] [700] podman play --service-container
- PR #23200
- 07-04 19:02 in [sys] [700] podman play --service-container
- PR #22985
- 07-09 05:59 in [sys] [700] podman play --service-container
- PR #22726
- 07-08 11:38 in [sys] [700] podman play --service-container
- 07-11 23:28 in [sys] [700] podman play --service-container
- 05-15-2024 09:06 in [sys] [700] podman play --service-container
fedora-38 : sys podman fedora-38 rootless host boltdb
- PR #21846
- 02-27-2024 15:05 in [sys] [700] podman play --service-container
- PR #21535
- 02-08-2024 09:30 in [sys] [700] podman play --service-container
- 03-05-2024 21:31 in [sys] [700] podman play --service-container
fedora-39 : int podman fedora-39 rootless host boltdb
- 07-11 13:44 in TOP-LEVEL [AfterEach] Podman kube play test get all key-value pairs from optional secret as envs
fedora-39 : int podman fedora-39 rootless host sqlite
- 03-26-2024 14:31 in TOP-LEVEL [AfterEach] Podman kube play multi doc yaml with multiple services, pods and deployments
- 03-20-2024 11:37 in TOP-LEVEL [AfterEach] Podman kube play test volumes-from annotation with source containers external
- 03-04-2024 09:06 in TOP-LEVEL [AfterEach] Podman kube play test volumes-from annotation with source container in pod
- 02-06-2024 08:21 in Podman kube play test with reserved init annotation in yaml
- 01-30-2024 09:16 in Podman kube play override with tcp should keep udp from YAML file
fedora-39 : sys podman fedora-39 rootless host boltdb
- PR #23203
- 07-05 05:14 in [sys] [700] podman play --service-container
- PR #22985
- 07-09 05:58 in [sys] [700] podman play --service-container
- PR #22736
- 06-24-2024 14:19 in [sys] [700] podman play --service-container
- PR #22726
- 07-02 15:27 in [sys] [700] podman play --service-container
- 06-03-2024 13:36 in [sys] [700] podman play --service-container in 4856ms
fedora-39 : sys podman fedora-39 rootless host sqlite
- PR #22277
- 04-05-2024 09:42 in [sys] [700] podman play --service-container
fedora-40 : int podman fedora-40 rootless host sqlite
- 07-11 18:35 in TOP-LEVEL [AfterEach] Podman kube play support container startup probe
- 07-11 09:13 in TOP-LEVEL [AfterEach] Podman kube play test with reserved autoremove annotation in yaml
fedora-40 : sys podman fedora-40 rootless host sqlite
- PR #23280
- 07-15 14:21 in [sys] [700] podman play --service-container
- PR #23275
- 07-23 20:39 in [sys] [700] podman play --service-container
- PR #23074
- 07-12 02:03 in [sys] [700] podman play --service-container
- PR #22726
- 07-09 09:11 in [sys] [700] podman play --service-container
- 07-11 17:03 in [sys] [700] podman kube play --replace external storage
rawhide : int podman rawhide rootless host sqlite
- 07-12 10:22 in TOP-LEVEL [AfterEach] Podman kube play should not rename pod if container in pod has same name
- 07-12 10:22 in TOP-LEVEL [AfterEach] Podman kube play test with reserved Label annotation in yaml
- 07-11 16:52 in TOP-LEVEL [AfterEach] Podman kube play with configmap in multi-doc yaml uses env value
- 03-23-2024 21:49 in TOP-LEVEL [AfterEach] Podman kube generate with volume
- 02-06-2024 11:21 in Podman kube play secret as volume support - multiple volumes
- 02-04-2024 12:39 in Podman kube play test restartPolicy
rawhide : sys podman rawhide rootless host sqlite
- PR #23220
- 07-08 05:40 in [sys] [700] podman play --service-container
- PR #22726
- 07-08 18:12 in [sys] [700] podman play --service-container

x	x	x	x	x	x
sys(24)	podman(40)	fedora-39(12)	rootless(40)	host(40)	sqlite(31)
int(16)		debian-13(10)			boltdb(9)
		rawhide(8)
		fedora-40(7)
		fedora-38(3)

Luap99 commented 3 months ago

My guess is that they have the same cause, that said I also don't understand what is going on other than we are trying to cleanup the networks twice so I cannot be sure about that.

Anyhow looks like this has gotten much worse so I put this on my priority list for next week.

edsantiago commented 3 months ago

Still happening frequently, even in non-Ed PRs. Here's one new error message I hadn't seen before (the middle one, aardvark netns):

<+024ms> # $ podman pod rm p-t115-2c0yuhvq
<+294ms> # time="2024-07-19T07:08:01-05:00" level=error msg="rootless netns ref counter out of sync, counter is at -1, resetting it back to 0"
         # [ERROR netavark::dns::aardvark] aardvark-dns runs in a different netns, dns will not work for this container. To resolve please stop all containers, kill the aardvark-dns process, remove the /run/user/2415/containers/networks/aardvark-dns directory and then start the containers again
         # time="2024-07-19T07:08:01-05:00" level=error msg="Unable to clean up network for container 6b7b957f005d00b8d66cfd93b3fe7393c871ce858708411cdba8e77d394855c4: \"1 error occurred:\\n\\t* netavark: failed to get bridge interface: Netlink error: No such device (os error 19)\\n\\n\""

This was in my parallel-bats PR

Luap99 commented 3 months ago

I have been looking at the errors for hours today without having any solid idea what is causing this stuff.

I found one issue around podman unshare --rootless-netns handling whent he command there exits > 0

ERRO[0005] rootless netns ref counter out of sync, counter is at -1, resetting it back to 0 
ERRO[0005] Failed to decrement ref count: rootless netns: write ref counter: open /run/user/1000/containers/networks/rootless-netns/ref-count: no such file or directory

But this is not directly related to the symptoms here. It is very clear from the error sequence that something is calling cleanupNetwork() twice but this should practically be impossible unless the cleanup process was killed after it called netavark but before it wrote the netns to the db (I don't see that happening though).

Your new error looks even weirder. Because netavark tries to delete the veth interface first (and there is no error logged about it so it worked) and only then the bridge (which failed). Given the aardvark-dns warning what happened is that something deleted the rootless-netns even though it was still in use then the cleanup created a new one and then no longer finding the expected interfaces there... So in short something decrementing the ref counter more than it should which well is what we are seeing in the other errors as well...

Luap99 commented 3 months ago

It would be great if we manage to get a local reproducer, I tried running the commands from the failed tests without luck.

edsantiago commented 3 months ago

It only fails in kube-related tests. Eliminating int tests and my bats-parallel PR, I see failures only in 700.bats in the podman play --service-container test. When I look there, I see a timeout $PODMAN ... &>/dev/null &. I will look into capturing and displaying that log.

Luap99 commented 3 months ago

That sounds like a good idea

edsantiago commented 3 months ago

Well, that was useless:

         # cat /tmp/podman-kube-bg.log:
         # Pod:
         # 8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063
         # Container:
         # 66f79a26bb2ba4110fc21dfd0bfac8fe5a07bda18e2c6aaf7524952a33d14152
         #
         #
<+015ms> # $ podman pod rm p-t645-xt1p2kpr
<+125ms> # time="2024-07-29T19:15:19Z" level=error msg="rootless netns ref counter out of sync, counter is at -1, resetting it back to 0"
         # time="2024-07-29T19:15:19Z" level=error msg="Unable to clean up network for container 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73: \"1 error occurred:\\n\\t* netavark: netavark encountered multiple errors:\\n\\t- IO error: No such file or directory (os error 2)\\n\\t- failed to delete container veth eth0: Netlink error: No such device (os error 19)\\n\\n\""
         # 8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063

Looking at the journal (search in-page for 8b00d6) I see this sequence:

Jul 29 19:15:19 cirrus-task-4883090584109056 kernel: podman1: port 1(veth0) entered disabled state
Jul 29 19:15:19 cirrus-task-4883090584109056 kernel: veth0 (unregistering): left allmulticast mode
Jul 29 19:15:19 cirrus-task-4883090584109056 kernel: veth0 (unregistering): left promiscuous mode
Jul 29 19:15:19 cirrus-task-4883090584109056 kernel: podman1: port 1(veth0) entered disabled state
Jul 29 19:15:19 cirrus-task-4883090584109056 systemd[3945]: Started podman-187478.scope.
....
Jul 29 19:15:19 cirrus-task-4883090584109056 podman[187507]: 2024-07-29 19:15:19.708555188 +0000 UTC m=+0.071896479 container remove 66f79a26bb2ba4110fc21dfd0bfac8fe5a07bda18e2c6aaf7524952a33d14152 (image=quay.io/libpod/testimage:20240123, name=p-t645-xt1p2kpr-c-t645-xt1p2kpr, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, app=test, created_at=2024-01-24T18:38:28Z, created_by=test/system/build-testimage, io.buildah.version=1.34.1-dev)
Jul 29 19:15:19 cirrus-task-4883090584109056 podman[187507]: 2024-07-29 19:15:19.731089306 +0000 UTC m=+0.094430598 container remove 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
Jul 29 19:15:19 cirrus-task-4883090584109056 podman[187507]: 2024-07-29 19:15:19.738002418 +0000 UTC m=+0.101343711 pod remove 8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063 (image=, name=p-t645-xt1p2kpr)

...that is, shut down veth0 before the pod-rm. And some systemd stuff nearby. Is it possible that systemd is engaging in some unwanted cleanup/skullduggery?

Luap99 commented 3 months ago

well cleanup happens when the ctr stops not when it is removed so it is expected that interfaces are removed before the actual container/pod is removed.

What is interesting here is to look for 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 in the journal which is the infra ctr id.

podman[187012]: 2024-07-29 19:15:16.973147745 +0000 UTC m=+0.320669136 container create 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
systemd[3945]: Started libpod-conmon-38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73.scope.
systemd[3945]: Started libpod-38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73.scope - libcrun container.
podman[187012]: 2024-07-29 19:15:17.273375993 +0000 UTC m=+0.620897395 container init 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
podman[187012]: 2024-07-29 19:15:17.27937793 +0000 UTC m=+0.626899335 container start 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
conmon[187127]: conmon 38ac0841b08c00a8ab84 <nwarn>: Failed to open cgroups file: /sys/fs/cgroup/user.slice/user-5174.slice/user@5174.service/user.slice/user-libpod_pod_8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063.slice/libpod-38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73.scope/container/memory.events
podman[187198]: 2024-07-29 19:15:17.860726101 +0000 UTC m=+0.125663852 container died 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, io.buildah.version=1.37.0-dev)
podman[187198]: 2024-07-29 19:15:18.132708377 +0000 UTC m=+0.397646117 container cleanup 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
podman[187306]: 2024-07-29 19:15:18.647994364 +0000 UTC m=+0.109348535 container restart 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
systemd[3945]: Started libpod-conmon-38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73.scope.
systemd[3945]: Started libpod-38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73.scope - libcrun container.
podman[187306]: 2024-07-29 19:15:18.856200067 +0000 UTC m=+0.317554231 container init 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
podman[187306]: 2024-07-29 19:15:18.860850626 +0000 UTC m=+0.322204787 container start 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
podman[187440]: 2024-07-29 19:15:19.342933574 +0000 UTC m=+0.070542825 container kill 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)
conmon[187374]: conmon 38ac0841b08c00a8ab84 <nwarn>: Failed to open cgroups file: /sys/fs/cgroup/user.slice/user-5174.slice/user@5174.service/user.slice/user-libpod_pod_8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063.slice/libpod-38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73.scope/container/memory.events
podman[187440]: 2024-07-29 19:15:19.381872905 +0000 UTC m=+0.109482177 container died 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, io.buildah.version=1.37.0-dev)
systemd[3945]: Stopping libpod-conmon-38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73.scope...
systemd[3945]: Stopped libpod-conmon-38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73.scope.
podman[187507]: 2024-07-29 19:15:19.731089306 +0000 UTC m=+0.094430598 container remove 38ac0841b08c00a8ab849f1fb84dd496ebfc7534a8879bda57c6c3796673ab73 (image=localhost/podman-pause:5.2.0-dev-1722278401, name=8b00d6fa17b6-infra, pod_id=8b00d6fa17b684c234e804ba462894bee356649d3ea82e4c3289a85f62b3f063, io.buildah.version=1.37.0-dev)

The important bit here is that there is no cleanup event after it got restarted and killed. Generally we expect

container died
container cleanup

But in this case there is no cleanup event at all suggesting that something went wrong during cleanup (possibly the process was killed or we hit some error but neither is logged anywhere in the journal so it is impossible to know) The the pod rm command sees the container and thinks cleanup wasn't done yet and tries again triggering the error.

edsantiago commented 3 months ago

Reproducer! Not a great one, complicated to set up yadda yadda, but here goes:

1mt with podman source tree, make, etc etc
edit test/system/helpers.bash, add return to the very top of leak_check() and clean_setup()
edit test/system/700*.bats, rename teardown() to cecinestpasteardown() or whatever. Just so it does not get called.
Two rootless terms. In each, run:
```
$ while :;do hack/bats --rootless 700:service-conta || break;done
```
Reproduces the out of sync error within minutes for me.

Then instrument podman to your heart's content.

HTH

edsantiago commented 3 months ago

23457 may have fixed one manifestation of the bug in system tests, but the bug persists in int tests.

containers / podman

podman stop: rootless netns ref counter out of sync, counter is at -1, resetting it back to 0 #21569

23457 may have fixed one manifestation of the bug in system tests, but the bug persists in int tests.