podman stop: Unable to clean up network: unmounting network namespace: device or resource busy

containers / podman

Podman: A tool for managing OCI containers and pods.

https://podman.io

Apache License 2.0

23.35k stars 2.38k forks source link

podman stop: Unable to clean up network: unmounting network namespace: device or resource busy #19721

Closed edsantiago closed 1 month ago

edsantiago commented 1 year ago

Seeing this quite often in my PR with #18442 cherrypicked:

? Enter [AfterEach] TOP-LEVEL
$ podman [options] stop --all -t 0
time="2023-08-22T18:28:28-05:00" level=error msg="Unable to clean up network for container XYZ:
    "unmounting network namespace for container XYZ:
        failed to remove ns path /run/user/3418/netns/netns-9965a9b5-facb-7fe9-44e3-f99ec7d69365:
            remove /run/user/3418/netns/netns-9965a9b5-facb-7fe9-44e3-f99ec7d69365: device or resource busy\""

Example: int f38 rootless.

It almost always happens together with the "Storage ... removed" flake (#19702), e.g.:

So I mostly file under that issue, because my flake tool has no provision for multiple buckets.

No pattern yet (that I can see) in when it fails, which tests, etc.

Luap99 commented 1 year ago

EBUSY should only be the error if the netns is still mounted. However we umount directly before the remove call so that should not cause problems.

The umount call uses MNT_DETACH so it may not actually umount right a away. I have no idea why this flag is used and how this interacts with the nsfs mount points.

edsantiago commented 1 year ago

Seeing this a lot, but again, usually with #19702 ("Storage ... removed") and I can't dual-assign flakes. So since that one already has a ton of sample failures, I'm going to start assigning flakes here. Maybe alternating. Here are last night's failures:

fedora-37 : int podman fedora-37 root host sqlite
- 08-23 23:14 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
fedora-38 : int podman fedora-38 root host sqlite
- 08-23 23:11 in TOP-LEVEL [AfterEach] Podman play kube podman play kube test with DirectoryOrCreate HostPath type volume
fedora-38 : int podman fedora-38 rootless host boltdb
- 08-22 19:57 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
rawhide : int podman rawhide rootless host sqlite
- 08-23 23:11 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
- 08-23 23:11 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options

One of those also includes this fun set of errors, does this help?

time="2023-08-23T21:47:02-05:00" level=error msg="IPAM error: failed to get ips for container ID 59fb21294fac284ee4dddf37c0ecb98e27118bdb0274314170465c452f2e8dce on network podman-default-kube-network"
           time="2023-08-23T21:47:02-05:00" level=error msg="IPAM error: failed to find ip for subnet 10.89.0.0/24 on network podman-default-kube-network"
           time="2023-08-23T21:47:02-05:00" level=error msg="tearing down network namespace configuration for container 59fb21294fac284ee4dddf37c0ecb98e27118bdb0274314170465c452f2e8dce: netavark: open container netns: open /run/netns/netns-0de355c1-cf59-05e0-c2cd-279f59a296f3: IO error: No such file or directory (os error 2)"

edsantiago commented 1 year ago

I've got a little more data. There seem to be two not-quite-identical failure modes, depending on root or rootless:

f38 rootless:

Unable to clean up network for container XYZ:
    unmounting network namespace for container XYZ:
    failed to remove ns path /run/user/3418/netns/netns-9965a9b5-facb-7fe9-44e3-f99ec7d69365:
    remove /run/user/3418/netns/netns-9965a9b5-facb-7fe9-44e3-f99ec7d69365: device or resource busy

f38 root:

IPAM error: failed to get ips for container ID XYZ on network podman
IPAM error: failed to find ip for subnet 10.88.0.0/16 on network podman
tearing down network namespace configuration for container XYZ:
    netavark: open container netns:
    open /run/netns/netns-050fe4d4-d5da-48ff-3b6b-9b4a1f445235: IO error: No such file or directory (os error 2)
Unable to clean up network for container XYZ:
    unmounting network namespace for container XYZ:
    failed to unmount NS: at /run/netns/netns-050fe4d4-d5da-48ff-3b6b-9b4a1f445235: no such file or directory

That is:

root has two IPAM errors
root has a "tearing down ... open container netns" error
rootless says "failed to remove ns path ... EBUSY", root says "failed to unmount NS ... ENOENT"

They're too similar for me to split this into two separate issues, but I'll listen to the opinion of experts.

HTH.

rhatdan commented 1 year ago

Do you think you can remove the MNT_DETATCH?

Luap99 commented 1 year ago

I have no idea why it was added in the first place, maybe it is needed?

Git blame goes all the way back to 8c52aa15f0e4927c0e570102efaa34dbe93d6156 which claims it needs MNT_DETATCH but provides no explanation at all why? @mheon This code was forked from cni upstream and that one never used it so...

f38 root:

IPAM error: failed to get ips for container ID XYZ on network podman
IPAM error: failed to find ip for subnet 10.88.0.0/16 on network podman
tearing down network namespace configuration for container XYZ:
    netavark: open container netns:
    open /run/netns/netns-050fe4d4-d5da-48ff-3b6b-9b4a1f445235: IO error: No such file or directory (os error 2)
Unable to clean up network for container XYZ:
    unmounting network namespace for container XYZ:
    failed to unmount NS: at /run/netns/netns-050fe4d4-d5da-48ff-3b6b-9b4a1f445235: no such file or directory

That is:

1. root has two IPAM errors

2. root has a "tearing down ... open container netns" error

The root issue most be something entirely else, symptoms look like we try to cleanup twice.

edsantiago commented 1 year ago

WHEW! After much suffering, I removed MNT_DETACH. That causes absolutely everything to fail hard, even system tests which so far have been immune to this flake.

mheon commented 1 year ago

I think I originally added the MNT_DETACH flag because we were seeing intermittent failures during cleanup due to the namespace still being in use, and I was expecting that making the unmount lazy would resolve things.

edsantiago commented 1 year ago

I'm giving up on this: I am pulling the stderr-on-teardown checks from my flake-check PR. It's too much, costing me way too much time between this and #19702. Until these two are fixed, I can't justify the time it takes me to sort through these flakes.

FWIW, here is the catalog so far:

fedora-37 : int podman fedora-37 root container sqlite
- 08-28 08:57 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options
- 08-27 09:08 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
- 08-26 18:25 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
- 08-24 11:23 in TOP-LEVEL [AfterEach] Podman cp podman cp file
- 08-24 11:23 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
fedora-37 : int podman fedora-37 root host sqlite
- 08-26 18:25 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options
- 08-24 11:25 in TOP-LEVEL [AfterEach] Podman cp podman cp file
- 08-23 23:14 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
fedora-37 : int podman fedora-37 rootless host sqlite
- 08-28 21:50 in TOP-LEVEL [AfterEach] Podman play kube podman play kube test HostAliases
- 08-28 13:20 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
- 08-28 11:35 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
- 08-28 08:56 in TOP-LEVEL [AfterEach] Podman start podman start single container by id
- 08-24 19:02 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
- 08-24 19:02 in TOP-LEVEL [AfterEach] Podman kube generate podman generate kube sharing pid namespace
- 08-24 11:21 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
fedora-38 : int podman fedora-38 root container sqlite
- 08-24 19:01 in TOP-LEVEL [AfterEach] Podman exec podman exec preserves container groups with --user and --group-add
fedora-38 : int podman fedora-38 root host sqlite
- 08-28 21:52 in TOP-LEVEL [AfterEach] Podman pod start podman pod start single pod by name
- 08-28 13:22 in TOP-LEVEL [AfterEach] Podman cp podman cp file
- 08-28 11:35 in TOP-LEVEL [AfterEach] Podman play kube podman play kube replace non-existing pod
- 08-28 11:35 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
- 08-27 09:05 in TOP-LEVEL [AfterEach] Podman container clone podman container clone basic test
- 08-24 19:01 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
- 08-24 11:23 in TOP-LEVEL [AfterEach] Podman start podman start container with special pidfile
- 08-23 23:11 in TOP-LEVEL [AfterEach] Podman play kube podman play kube test with DirectoryOrCreate HostPath type volume
fedora-38 : int podman fedora-38 rootless host boltdb
- 08-22 19:57 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
fedora-38 : int podman fedora-38 rootless host sqlite
- 08-27 09:05 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
rawhide : int podman rawhide root host sqlite
- 08-28 21:55 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options
- 08-28 13:19 in TOP-LEVEL [AfterEach] Podman start podman container start single container by short id
- 08-28 13:19 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
- 08-28 08:53 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
- 08-27 21:41 in TOP-LEVEL [AfterEach] Podman play kube podman play kube test with customized hostname
- 08-26 22:23 in TOP-LEVEL [AfterEach] Podman play kube podman play kube test optional env value from missing configmap
- 08-24 19:00 in TOP-LEVEL [AfterEach] Podman play kube podman play kube --no-host
rawhide : int podman rawhide rootless host sqlite
- 08-28 11:36 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
- 08-27 21:40 in TOP-LEVEL [AfterEach] Podman start podman start container with special pidfile
- 08-26 20:35 in TOP-LEVEL [AfterEach] Podman play kube podman play kube multi doc yaml with multiple services, pods and deployments
- 08-26 20:35 in TOP-LEVEL [AfterEach] Podman cp podman cp file
- 08-24 19:00 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
- 08-23 23:11 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
- 08-23 23:11 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options

Seen in: int podman fedora-37/fedora-38/rawhide root/rootless container/host boltdb/sqlite

github-actions[bot] commented 1 year ago

A friendly reminder that this issue had no activity for 30 days.

edsantiago commented 1 year ago

fedora-37 : int podman fedora-37 root container sqlite
- 08-28-2023 08:57 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options
- 08-27-2023 09:08 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
- 08-26-2023 18:25 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
- 08-24-2023 11:23 in TOP-LEVEL [AfterEach] Podman cp podman cp file
- 08-24-2023 11:23 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
fedora-37 : int podman fedora-37 root host sqlite
- 08-26-2023 18:25 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options
- 08-24-2023 11:25 in TOP-LEVEL [AfterEach] Podman cp podman cp file
- 08-23-2023 23:14 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
fedora-37 : int podman fedora-37 rootless host sqlite
- 08-28-2023 21:50 in TOP-LEVEL [AfterEach] Podman play kube podman play kube test HostAliases
- 08-28-2023 13:20 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
- 08-28-2023 11:35 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
- 08-28-2023 08:56 in TOP-LEVEL [AfterEach] Podman start podman start single container by id
- 08-24-2023 19:02 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
- 08-24-2023 19:02 in TOP-LEVEL [AfterEach] Podman kube generate podman generate kube sharing pid namespace
- 08-24-2023 11:21 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
fedora-38 : int podman fedora-38 root container sqlite
- 08-24-2023 19:01 in TOP-LEVEL [AfterEach] Podman exec podman exec preserves container groups with --user and --group-add
fedora-38 : int podman fedora-38 root host sqlite
- 08-28-2023 21:52 in TOP-LEVEL [AfterEach] Podman pod start podman pod start single pod by name
- 08-28-2023 13:22 in TOP-LEVEL [AfterEach] Podman cp podman cp file
- 08-28-2023 11:35 in TOP-LEVEL [AfterEach] Podman play kube podman play kube replace non-existing pod
- 08-28-2023 11:35 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
- 08-27-2023 09:05 in TOP-LEVEL [AfterEach] Podman container clone podman container clone basic test
- 08-24-2023 19:01 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
- 08-24-2023 11:23 in TOP-LEVEL [AfterEach] Podman start podman start container with special pidfile
- 08-23-2023 23:11 in TOP-LEVEL [AfterEach] Podman play kube podman play kube test with DirectoryOrCreate HostPath type volume
fedora-38 : int podman fedora-38 rootless host boltdb
- PR #17831
- 08-22-2023 19:57 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
fedora-38 : int podman fedora-38 rootless host sqlite
- 08-27-2023 09:05 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
rawhide : int podman rawhide root host sqlite
- 08-28-2023 21:55 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options
- 08-28-2023 13:19 in TOP-LEVEL [AfterEach] Podman start podman container start single container by short id
- 08-28-2023 13:19 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
- 08-28-2023 08:53 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
- 08-27-2023 21:41 in TOP-LEVEL [AfterEach] Podman play kube podman play kube test with customized hostname
- 08-26-2023 22:23 in TOP-LEVEL [AfterEach] Podman play kube podman play kube test optional env value from missing configmap
- 08-24-2023 19:00 in TOP-LEVEL [AfterEach] Podman play kube podman play kube --no-host
rawhide : int podman rawhide rootless host sqlite
- 08-28-2023 11:36 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
- 08-27-2023 21:40 in TOP-LEVEL [AfterEach] Podman start podman start container with special pidfile
- 08-26-2023 20:35 in TOP-LEVEL [AfterEach] Podman play kube podman play kube multi doc yaml with multiple services, pods and deployments
- 08-26-2023 20:35 in TOP-LEVEL [AfterEach] Podman cp podman cp file
- 08-24-2023 19:00 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
- 08-23-2023 23:11 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
- 08-23-2023 23:11 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options
rawhide : sys podman rawhide rootless host sqlite
- 09-27 13:50 in [sys] podman kube --network
- PR #13808
- 09-28 14:58 in [sys] podman kube with stdin

Seen in: int/sys fedora-37/fedora-38/rawhide root/rootless container/host boltdb/sqlite

Luap99 commented 1 year ago

I see different errors posted here that are not the same!

The original report says: remove /run/user/3418/netns/netns-9965a9b5-facb-7fe9-44e3-f99ec7d69365: device or resource busy\""

Just to confirm I looked at ip netns which also does a simple bind mount for the netns paths and to delete they also simply call umount with MNT_DETACH followed by unlink() which is exactly what our code does as well.

https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/tree/ip/ipnetns.c#n735 https://github.com/containers/common/blob/6856d56252121a665cb820777982cc3f61f815af/pkg/netns/netns_linux.go#L191-L197

So this is not something I can understand, the unlink should not fail with EBUSY if it is unmounted.

But I also see: failed to unmount NS: at /run/user/3138/netns/netns-fa758949-40a2-a5e2-2226-5523b2a4c0e7: no such file or directory These are two different things. no such file or directory means we are trying to cleanup again after something else has already cleaned that up. This matches with the other error that is logged when this happens: ="Storage for container 86154f69fa1feff30896274cc265e37fa78745adb1bd2778927053f8bbe7be36 has been removed"

The same goes for errors like this:

time="2023-08-23T21:47:02-05:00" level=error msg="IPAM error: failed to get ips for container ID 59fb21294fac284ee4dddf37c0ecb98e27118bdb0274314170465c452f2e8dce on network podman-default-kube-network"
           time="2023-08-23T21:47:02-05:00" level=error msg="IPAM error: failed to find ip for subnet 10.89.0.0/24 on network podman-default-kube-network"
           time="2023-08-23T21:47:02-05:00" level=error msg="tearing down network namespace configuration for container 59fb21294fac284ee4dddf37c0ecb98e27118bdb0274314170465c452f2e8dce: netavark: open container netns: open /run/netns/netns-0de355c1-cf59-05e0-c2cd-279f59a296f3: IO error: No such file or directory (os error 2)"

Here there must be a way that we for whatever reason end up in the cleanup path twice.

mhoran commented 1 year ago

I do see the device or resource busy error every now and then. Oddly, just recently it seemed to clear itself up, eventually. When I podman unshare I cannot umount the file, nor remove it. mount reports not mounted; rm reports device or resource busy. However, I believe from outside the namespace, I can delete the file. lsns --type=net does not show any corresponding NSFS. Quite odd indeed.

edsantiago commented 1 year ago

I see different errors posted here that are not the same!

That is my fault, and I'm sorry. When scanning flakes to assign them to buckets, I look for common patterns but don't always compare everything exactly. I will be more careful.

edsantiago commented 11 months ago

New flake. Not quite the same error message, but similar enough that I'm assigning here. From f39 rootless:

[+1802s] not ok 599 podman kube --network
...
<+010ms> # $ podman pod rm -t 0 -f test_pod
<+197ms> # time="2023-10-25T08:47:25-05:00" level=error msg="Unable to clean up network for container [infractrid]: \"unmounting network namespace for container [infractrid]: failed to unmount NS: at /run/user/6452/netns/netns-whatever: no such file or directory\""
         # 6b2f9b7e780b9986feb72921c5d8d059687b831537aa1a14d0456b3d67b3d3ee
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #| FAIL: Command succeeded, but issued unexpected warnings
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

edsantiago commented 10 months ago

Another ENOENT, in f39 rootless:

[+1763s] not ok 607 [700] podman kube play --replace external storage
...
<+011ms> # $ podman stop -a -t 0
<+451ms> # 5562c7db09b37c167612109c79237ac32f7fb7575d0b9819ad8949b8499e4cec
         # b48e32245140e2fbe17564503b67b1ff5c1263e2fd4aa3fcdea8afde28c5a626
         #
<+011ms> # $ podman pod rm -t 0 -f test_pod
<+190ms> # time="2023-11-10T13:36:19-06:00" level=error msg="getting rootless network namespace: failed to Statfs \"/run/user/2815/netns/rootless-netns-bfe0fe1f8f170aff795c\": no such file or directory"
         # time="2023-11-10T13:36:19-06:00" level=error msg="Unable to clean up network for container 5562c7db09b37c167612109c79237ac32f7fb7575d0b9819ad8949b8499e4cec: \"unmounting network namespace for container 5562c7db09b37c167612109c79237ac32f7fb7575d0b9819ad8949b8499e4cec: failed to unmount NS: at /run/user/2815/netns/netns-6ae41715-c698-4a92-8658-020349c94f6f: no such file or directory\""
         # 19faf1a389d250f9a4a71fcdae449d6f2c38a89e2e51acc9bde5ec912db1093f
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #| FAIL: Command succeeded, but issued unexpected warnings
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

edsantiago commented 10 months ago

The latest list. Note that some of these are ENOENT, and some are EBUSY. Until given a reason to treat these as different bugs, I will continue to lump them together.

fedora-37 : int podman fedora-37 rootless host sqlite
- 09-22-2023 12:10 in Podman stop podman stop --ignore bogus container
fedora-38 : int podman fedora-38 root host boltdb
- 12-05 10:02 in Podman stop podman stop --ignore bogus container
- 11-30 19:39 in Podman stop podman stop --ignore bogus container
- 11-29 20:10 in Podman stop podman stop --ignore bogus container
- 11-28 15:37 in Podman pod stop podman pod stop single pod by name
- 11-06-2023 11:53 in Podman stop podman stop container by id
- 11-06-2023 11:53 in Podman stop podman stop --ignore bogus container
fedora-38 : int podman fedora-38 root host sqlite
- 10-31-2023 17:22 in Podman stop podman stop --ignore bogus container
- 10-31-2023 11:36 in Podman prune podman system prune pods
- 10-26-2023 19:05 in Podman stop podman stop --ignore bogus container
- 10-26-2023 19:05 in Podman stop podman stop container by id
- 10-25-2023 18:39 in Podman stop podman stop --ignore bogus container
- 09-28-2023 12:00 in Podman stop podman stop --ignore bogus container
fedora-38 : int podman fedora-38 rootless host boltdb
- PR #20161
- 10-17-2023 15:44 in Podman stop podman stop --ignore bogus container
- 10-17-2023 15:44 in Podman stop podman stop container by id
- 10-17-2023 15:44 in Podman stop podman stop single container by short id
- 11-29 20:07 in Podman stop podman stop --ignore bogus container
- 11-06-2023 17:53 in Podman stop podman stop --ignore bogus container
- 11-03-2023 22:07 in Podman prune podman system prune with running, exited pod and volume prune set true
fedora-38 : int podman fedora-38 rootless host sqlite
- 10-11-2023 10:52 in Podman stop podman stop --ignore bogus container
fedora-38 : sys podman fedora-38 root host boltdb
- PR #20750
- 11-22 15:32 in [sys] [050] podman stop print IDs or raw input
fedora-39 : int podman fedora-39 root host sqlite
- 09-27-2023 21:49 in Podman stop podman stop --ignore bogus container
fedora-39 : int podman fedora-39 rootless host boltdb
- 09-27-2023 15:37 in Podman stop podman stop --ignore bogus container
fedora-39 : int podman fedora-39 rootless host sqlite
- 09-27-2023 08:58 in Podman stop podman stop --ignore bogus container
fedora-39β : int podman fedora-39β root host boltdb
- 10-25-2023 08:38 in Podman stop podman stop --ignore bogus container
- 09-28-2023 13:13 in Podman stop podman stop --ignore bogus container
rawhide : int podman rawhide root host sqlite
- 11-06-2023 07:35 in Podman stop podman stop --ignore bogus container
- 09-27-2023 21:53 in Podman stop podman stop --ignore bogus container
- 09-27-2023 15:38 in Podman prune podman system prune with running, exited pod and volume prune set true
rawhide : int podman rawhide rootless host sqlite
- 10-18-2023 23:04 in Podman stop podman stop --ignore bogus container

x	x	x	x	x	x
int(29)	podman(30)	fedora-38(20)	root(19)	host(30)	boltdb(16)
sys(1)		rawhide(4)	rootless(11)		sqlite(14)
		fedora-39(3)
		fedora-39β(2)
		fedora-37(1)

piotr-kubiak commented 8 months ago

The root issue most be something entirely else, symptoms look like we try to cleanup twice.

If that provides any clues, I am observing simmilar behaviour when the container is configured with restart policy unless-stopped.

edsantiago commented 7 months ago

With #18442, this is now blowing up. Different error messages, but I'm pretty sure it's all the same bug.

New EACCESS variant:

# podman [options] stop --all -t 0
time="2024-03-04T06:21:40-06:00" level=error msg="Unable to clean up network for container 4f387d5ee492fa77fa287eadbbdb6725aa4e24b879de087cd6d89b6f59014e84: \"netavark: remove aardvark entries: check aardvark-dns netns: IO error: Permission denied (os error 13)\""

and the old ENOENT variant:

# podman [options] stop --all -t 0
time="2024-03-04T06:26:19-06:00" level=error msg="IPAM error: failed to get ips for container ID f5d3911edc288ff629303bfb7d9bb18224d755b2c80fb3fda9cdd6827efe94fe on network podman"
time="2024-03-04T06:26:19-06:00" level=error msg="IPAM error: failed to find ip for subnet 10.88.0.0/16 on network podman"
time="2024-03-04T06:26:19-06:00" level=error msg="netavark: open container netns: open /run/netns/netns-8fd7cfa2-c3d6-d361-98ca-bdf199cb29f3: IO error: No such file or directory (os error 2)"
time="2024-03-04T06:26:19-06:00" level=error msg="Unable to clean up network for container f5d3911edc288ff629303bfb7d9bb18224d755b2c80fb3fda9cdd6827efe94fe: \"unmounting network namespace for container f5d3911edc288ff629303bfb7d9bb18224d755b2c80fb3fda9cdd6827efe94fe: failed to remove ns path: remove /run/netns/netns-8fd7cfa2-c3d6-d361-98ca-bdf199cb29f3: no such file or directory, failed to unmount NS: at /run/netns/netns-8fd7cfa2-c3d6-d361-98ca-bdf199cb29f3: no such file or directory\""

...and an ENOENT variant with a shorter error message:

$ podman [options] stop --all -t 0
time="2024-03-04T06:27:52-06:00" level=error msg="Unable to clean up network for container 52fd74d45cfc07d19cd49cbda438f9e57f07b37bc80627064cffb1bdc02461ad: \"unmounting network namespace for container 52fd74d45cfc07d19cd49cbda438f9e57f07b37bc80627064cffb1bdc02461ad: failed to remove ns path: remove /run/user/2221/netns/netns-db173238-0a50-83c3-bb71-ca4e7d30a28b: no such file or directory, failed to unmount NS: at /run/user/2221/netns/netns-db173238-0a50-83c3-bb71-ca4e7d30a28b: no such file or directory\""

Here are today's failures, plus one from January

fedora-39 : int podman fedora-39 root host sqlite
- 03-04 07:36 in TOP-LEVEL [AfterEach] Podman kube play with auto update annotations for first container only
fedora-39 : sys podman fedora-39 rootless host sqlite
- PR #21331
- 01-22-2024 18:12 in [sys] [700] podman kube play --replace external storage
rawhide : int podman rawhide root host sqlite
- 03-04 07:35 in TOP-LEVEL [AfterEach] Podman start podman container start single container by id
rawhide : int podman rawhide rootless host sqlite
- 03-04 07:35 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1

x	x	x	x	x	x
int(3)	podman(4)	fedora-39(2)	root(2)	host(4)	sqlite(4)
sys(1)		rawhide(2)	rootless(2)

edsantiago commented 6 months ago

Still happening with brand-new (March 19) VMs, Three failures in just one CI run:

edsantiago commented 6 months ago

Is this (f39 rootless) the same error???

$ podman [options] stop --all -t 0
time="2024-03-20T12:25:48-05:00" \
    level=error \
    msg="Unable to clean up network for container SHA: \
        \"1 error occurred:\\n
        \\t* netavark: remove aardvark entries: check aardvark-dns netns: IO error: Permission denied (os error 13)\\n\\n\""

I'm going to assume it is, and file it as such, unless told otherwise.

Luap99 commented 6 months ago

Is this (f39 rootless) the same error???

$ podman [options] stop --all -t 0
time="2024-03-20T12:25:48-05:00" \
    level=error \
    msg="Unable to clean up network for container SHA: \
        \"1 error occurred:\\n
        \\t* netavark: remove aardvark entries: check aardvark-dns netns: IO error: Permission denied (os error 13)\\n\\n\""

I'm going to assume it is, and file it as such, unless told otherwise.

No that is most likely something different

edsantiago commented 6 months ago

My periodic ping. This seems to be happening a lot more with recent VMs

fedora-38 : int podman fedora-38 rootless host boltdb
- 04-02 12:16 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
- 03-25 16:49 in TOP-LEVEL [AfterEach] Podman kube play with TerminationGracePeriodSeconds set
fedora-39 : int podman fedora-39 root host sqlite
- 04-01 15:46 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options
- 03-25 17:55 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
- 03-20 13:40 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
- 03-19 17:53 in TOP-LEVEL [AfterEach] Podman start podman start container --filter
fedora-39 : int podman fedora-39 rootless host sqlite
- 03-19 23:26 in TOP-LEVEL [AfterEach] Podman kube play test with reserved PublishAll annotation in yaml
- 03-19 16:19 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
rawhide : int podman rawhide root host sqlite
- 04-01 13:31 in TOP-LEVEL [AfterEach] Podman pod start podman pod start single pod by name
- 03-26 11:41 in TOP-LEVEL [AfterEach] Podman pod start podman pod start single pod by name
- 03-23 21:48 in TOP-LEVEL [AfterEach] Podman kube play with auto update annotations for first container only
- 03-23 21:48 in TOP-LEVEL [AfterEach] Podman kube play test volumes-from annotation with source container in pod
rawhide : int podman rawhide rootless host sqlite
- 04-01 15:46 in TOP-LEVEL [AfterEach] Podman kube play with image data
- 04-01 13:27 in TOP-LEVEL [AfterEach] Podman kube play test with reserved privileged annotation in yaml
- 03-20 13:37 in TOP-LEVEL [AfterEach] Podman kube play replace
- 03-20 10:22 in TOP-LEVEL [AfterEach] Podman kube play override with udp should keep tcp from YAML file
- 03-19 17:39 in TOP-LEVEL [AfterEach] Podman kube play cap add
- 03-19 17:39 in TOP-LEVEL [AfterEach] Podman kube play override with tcp should keep udp from YAML file

x	x	x	x	x	x
int(18)	podman(18)	rawhide(10)	rootless(10)	host(18)	sqlite(16)
		fedora-39(6)	root(8)		boltdb(2)
		fedora-38(2)

edsantiago commented 4 months ago

Where are we on this one? I just saw this failure, f40 root, on VMs from https://github.com/containers/automation_images/pull/349 with netavark 1.10.3-3:

# podman [options] stop --all -t 0
time="2024-05-08T08:40:48-05:00" level=error msg="IPAM error: failed to get ips for container ID 4966574a794623c18a431d97adf2ea6192819e96755529acd35a529670985b69 on network podman"
time="2024-05-08T08:40:48-05:00" level=error msg="IPAM error: failed to find ip for subnet 10.88.0.0/16 on network podman"
time="2024-05-08T08:40:48-05:00" level=error msg="netavark: open container netns: open /run/netns/netns-dce2075a-1bca-f84d-c945-d1bc1641f2f6: IO error: No such file or directory (os error 2)"
time="2024-05-08T08:40:48-05:00" level=error msg="Unable to clean up network for container 4966574a794623c18a431d97adf2ea6192819e96755529acd35a529670985b69: \"unmounting network namespace for container 4966574a794623c18a431d97adf2ea6192819e96755529acd35a529670985b69: failed to remove ns path: remove /run/netns/netns-dce2075a-1bca-f84d-c945-d1bc1641f2f6: no such file or directory, failed to unmount NS: at /run/netns/netns-dce2075a-1bca-f84d-c945-d1bc1641f2f6: no such file or directory\""

(error in this one is ENOENT, not EBUSY)

Luap99 commented 4 months ago

FYI: The reason you see this more is because I enabled the warnings check in AfterEach() https://github.com/containers/podman/pull/18442

So previously we just didn't see them, in the logs above they all failed in AfterEach, as mentioned before and in other issues the problem is that something tries to cleanup twice but I cannot see why and where this would be happening.

edsantiago commented 4 months ago

Here's one with a lot more context, does that help? (Three total failures in this log, so be sure to Page-End then click on each individual failure)

Luap99 commented 4 months ago

Here's one with a lot more context, does that help? (Three total failures in this log, so be sure to Page-End then click on each individual failure)

Not really all these messages are just complaining that they do not find what they would expect from it if you ask for cleanup. The weird part is the sequence, why does it fail only on pod rm because before that we run stop --all -t 0 which should stop all the container and do the cleanup. So then why does pod rm do the cleanup again?

Luap99 commented 4 months ago

@edsantiago Can you spin this patch through your testing PR:

diff --git a/libpod/container_internal.go b/libpod/container_internal.go
index 7f909b78d..ce5dbe0ef 100644
--- a/libpod/container_internal.go
+++ b/libpod/container_internal.go
@@ -1396,13 +1396,13 @@ func (c *Container) stop(timeout uint) error {
        if cannotStopErr != nil {
                return cannotStopErr
        }
-       if !c.batched {
+       if !c.batched && timeout > 0 {
                c.lock.Unlock()
        }

        stopErr := c.ociRuntime.StopContainer(c, timeout, all)

-       if !c.batched {
+       if !c.batched && timeout > 0 {
                c.lock.Lock()
                if err := c.syncContainer(); err != nil {
                        if errors.Is(err, define.ErrNoSuchCtr) || errors.Is(err, define.ErrCtrRemoved) {

Not really a fix as I don't understand the cause but this is one place where we unlock during a critical section so if we no longer see the issues with that than at least I know where it starts to go wrong.

edsantiago commented 4 months ago

Good timing. I just patched & pushed it out. If I see ~5 runs without any of these flakes, I will report back. (Or, of course, if I see the flake even with this patch).

edsantiago commented 4 months ago

Sorry

Luap99 commented 4 months ago

Sorry

No need to be sorry, knowing that is good enough for me. And also the fact that this is fails on a simple podman create + start test is very good, I was suspecting some weird pod/container dependencies issues or something with the restart policy always but this seem to exclude such possibilities.

edsantiago commented 4 months ago

The "pid file blah" message is new, does it help? In f39 rootless:

   $ podman [options] pod rm -fa -t 0
   time="2024-05-23T07:10:51-05:00" level=error msg="IPAM error: failed to get ips for container ID 5d282f67ae0c245f323779c5a57c579bd6de58bc7fbb688da46ffec96e8cc7b9 on network podman-default-kube-network"
   time="2024-05-23T07:10:51-05:00" level=error msg="rootless netns ref counter out of sync, counter is at -1, resetting it back to 0"
>> time="2024-05-23T07:10:51-05:00" level=warning msg="failed to read rootless netns program pid: open /tmp/CI_E7Th/podman-e2e-1332300349/subtest-2709937137/runroot/networks/rootless-netns/rootless-netns-conn.pid: no such file or directory"
   time="2024-05-23T07:10:51-05:00" level=error msg="IPAM error: failed to find ip for subnet 10.89.0.0/24 on network podman-default-kube-network"
   time="2024-05-23T07:10:51-05:00" level=error msg="1 error occurred:\n\t* netavark: open container netns: open /run/user/4848/netns/netns-496a4e9f-0424-fab8-8bd4-9ffb106b0d86: IO error: No such file or directory (os error 2)\n\n"
   time="2024-05-23T07:10:51-05:00" level=error msg="Unable to clean up network for container 5d282f67ae0c245f323779c5a57c579bd6de58bc7fbb688da46ffec96e8cc7b9: \"unmounting network namespace for container 5d282f67ae0c245f323779c5a57c579bd6de58bc7fbb688da46ffec96e8cc7b9: failed to remove ns path: remove /run/user/4848/netns/netns-496a4e9f-0424-fab8-8bd4-9ffb106b0d86: no such file or directory, failed to unmount NS: at /run/user/4848/netns/netns-496a4e9f-0424-fab8-8bd4-9ffb106b0d86: no such file or directory\""

Luap99 commented 4 months ago

No not really

francoism90 commented 3 months ago

Any workarounds? I have the same issue.

Luap99 commented 3 months ago

Any workarounds? I have the same issue.

This is (mostly) a flake tracker for CI, if you have a reproducer for this issue I would be very happy if you can share it. Then maybe I can understand the issue and see if there is a workaround/fix.

francoism90 commented 3 months ago

@Luap99 It seems indeed to be happen when using a wrong configuration.

Sometimes the only workaround is to fully reboot the machine, as the rootlessport process isn't being killed or even killable.

Luap99 commented 3 months ago

@Luap99 It seems indeed to be happen when using a wrong configuration.

Sometimes the only workaround is to fully reboot the machine, as the rootlessport process isn't being killed or even killable.

Can you share podman commands, what does wrong configuration mean? I also don't see how this is related rootlessport. The rootlessport process should exit on its own when conmon after conmon exits. If it does not do that then it is a bug, however it should always be killable via SIGKILL I assume.

francoism90 commented 3 months ago

@Luap99 I'm using Podman Quadlet.

The problem occurs when a container fails to run at some point in the startup. Like you accidentally created a configuration mistake.

It's possible to stop the container, but for some reason the network process doesn't know what to do anymore, and you'll end up the IPAM error or cannot clean network.

The only way is to fully kill all the network related process for Podman, but even that may not be enough. By rebooting the machine, the container and it's configured network starts just fine.

I haven't used Docker rootless on Linux for a long time, but I can remember the same problem. It just seems to lose track of the networking and doesn't know what is actually killed or not. As I said before, the rootlessport processes stay open, even when you stop the container.

francoism90 commented 3 months ago

An example to trigger this, is by using a webserver image, nginx-alpine:latest for example. Make sure it's using a network, running rootless, and also use requires=nginx as service option for other containers that communicate with the container.

Add a site/server-block in the nginx container, and make a mistake in it's configuration. For example listen 8080 instead of listen 8080; (notice the missing ending ;). Something that can be easily overseen. I don't know if invalid proxy address also triggers it.

When you now start the container(s), it will crash the main process, because of a configuration error. So you stop the container, adjust the configuration, but when starting again, you'll end up with network errors.

Sometimes it can be fixed by killing all process, like rootlessport and things connected to it. But sometimes you really have to reboot to recover. After a reboot everything works as expected.

This is my full config, if you're interested: https://github.com/foxws/foxws/tree/main/podman

edsantiago commented 2 months ago

This one has happening A LOT in my no-retry PR. And, in case it helps, also in my parallel-system-tests PR.

debian-13 : int podman debian-13 root host sqlite
- 06-03-2024 09:16 in TOP-LEVEL [AfterEach] Podman kube play support List kind
debian-13 : int podman debian-13 rootless host sqlite
- 07-22 11:54 in TOP-LEVEL [AfterEach] Podman kube play multi doc yaml with multiple services, pods and deployments
- 07-22 07:49 in TOP-LEVEL [AfterEach] Podman kube generate - --privileged container
- 07-17 17:12 in TOP-LEVEL [AfterEach] Podman kube play test with reserved privileged annotation in yaml
- 07-16 07:30 in TOP-LEVEL [AfterEach] Podman kube play test with reserved init annotation in yaml
- 07-11 16:50 in TOP-LEVEL [AfterEach] Podman start podman container start single container by short id
- 07-07 17:26 in TOP-LEVEL [AfterEach] Podman kube play test with reserved CIDFile annotation in yaml
- 06-03-2024 13:10 in TOP-LEVEL [AfterEach] Podman kube play test with reserved init annotation in yaml
- 05-15-2024 10:10 in TOP-LEVEL [AfterEach] Podman kube play should not rename pod if container in pod has same name
- 05-15-2024 10:10 in TOP-LEVEL [AfterEach] Podman kube play with image data
- 05-15-2024 10:10 in TOP-LEVEL [AfterEach] Podman kube generate - --privileged container
debian-13 : sys podman debian-13 rootless host sqlite
- PR #23275
- 07-18 12:51 in [sys] [700] podman play --service-container
fedora-38 : int podman fedora-38 root container boltdb
- 04-02-2024 18:03 in TOP-LEVEL [AfterEach] Podman kube play support List kind
fedora-38 : int podman fedora-38 root host boltdb
- 04-04-2024 10:31 in TOP-LEVEL [AfterEach] Podman start podman start single container by id
- 03-19-2024 12:02 in TOP-LEVEL [AfterEach] Podman kube play test restartPolicy
- 03-05-2024 15:21 in TOP-LEVEL [AfterEach] Podman kube play should not rename pod if container in pod has same name
fedora-38 : int podman fedora-38 rootless host boltdb
- 04-02-2024 12:16 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
- 03-25-2024 16:49 in TOP-LEVEL [AfterEach] Podman kube play with TerminationGracePeriodSeconds set
fedora-39 : int podman fedora-39 root container boltdb
- 07-11 18:35 in TOP-LEVEL [AfterEach] Podman kube play cap add
fedora-39 : int podman fedora-39 root container sqlite
- 04-02-2024 21:15 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
- 04-02-2024 18:05 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
fedora-39 : int podman fedora-39 root host sqlite
- 04-04-2024 21:56 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
- 04-03-2024 09:52 in TOP-LEVEL [AfterEach] Podman start podman start container --filter
- 04-01-2024 15:46 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options
- 03-25-2024 17:55 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
- 03-20-2024 13:40 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
- 03-19-2024 17:53 in TOP-LEVEL [AfterEach] Podman start podman start container --filter
- 03-05-2024 07:27 in TOP-LEVEL [AfterEach] Podman start podman start single container by id
- 03-04-2024 12:20 in TOP-LEVEL [AfterEach] Podman start podman start single container by id
- 03-04-2024 10:42 in TOP-LEVEL [AfterEach] Podman kube play with TerminationGracePeriodSeconds set
fedora-39 : int podman fedora-39 rootless host boltdb
- PR #22866
- 05-31-2024 11:19 in Podman kube play with image data
- 07-17 17:13 in TOP-LEVEL [AfterEach] Podman kube play test with init containers and annotation set
- 06-03-2024 09:18 in TOP-LEVEL [AfterEach] Podman kube play with configMap subpaths
- 05-23-2024 08:20 in TOP-LEVEL [AfterEach] Podman kube play replace non-existing pod
- 05-23-2024 08:20 in TOP-LEVEL [AfterEach] Podman kube play test with reserved init annotation in yaml
- 05-23-2024 08:20 in TOP-LEVEL [AfterEach] Podman play kube with build --build should override image in store
fedora-39 : int podman fedora-39 rootless host sqlite
- 04-02-2024 16:21 in TOP-LEVEL [AfterEach] Podman kube play RunAsUser
- 03-19-2024 23:26 in TOP-LEVEL [AfterEach] Podman kube play test with reserved PublishAll annotation in yaml
- 03-19-2024 16:19 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
- 03-19-2024 12:00 in TOP-LEVEL [AfterEach] Podman start podman container start single container by short id
- 03-06-2024 11:07 in TOP-LEVEL [AfterEach] Podman kube play with TerminationGracePeriodSeconds set
- 03-06-2024 09:32 in 16973
fedora-40 : int podman fedora-40 root container sqlite
- 07-15 11:05 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
- 07-11 16:52 in TOP-LEVEL [AfterEach] Podman start podman start container with special pidfile
- 06-03-2024 07:23 in TOP-LEVEL [AfterEach] Podman start podman container start single container by id
fedora-40 : int podman fedora-40 root host sqlite
- 05-15-2024 11:37 in TOP-LEVEL [AfterEach] Podman cp podman cp file
- 05-15-2024 11:37 in TOP-LEVEL [AfterEach] Podman start podman container start single container by id
- 05-15-2024 10:11 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
- 05-08-2024 10:03 in TOP-LEVEL [AfterEach] Podman cp podman cp file
fedora-40 : int podman fedora-40 rootless host sqlite
- 05-23-2024 08:19 in TOP-LEVEL [AfterEach] Podman kube play test get all key-value pairs from optional configmap as envs
- 05-23-2024 08:19 in TOP-LEVEL [AfterEach] Podman kube play with configmap in multi-doc yaml succeeds for optional env value with missing key
fedora-40 : sys podman fedora-40 rootless host sqlite
- 07-22 12:05 in [sys] [700] podman play --service-container
rawhide : int podman rawhide root host sqlite
- 07-11 13:45 in TOP-LEVEL [AfterEach] Podman start podman start container with special pidfile
- 07-07 17:28 in TOP-LEVEL [AfterEach] Podman start podman start single container by name
- 05-15-2024 10:12 in TOP-LEVEL [AfterEach] Podman cp podman cp file
- 05-06-2024 11:02 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
- 04-04-2024 21:56 in TOP-LEVEL [AfterEach] Podman start podman container start single container by id
- 04-02-2024 21:12 in TOP-LEVEL [AfterEach] Podman start podman start single container by id
- 04-02-2024 21:12 in TOP-LEVEL [AfterEach] Podman kube play with configmap in multi-doc yaml and files uses all env values from both sources
- 04-01-2024 13:31 in TOP-LEVEL [AfterEach] Podman pod start podman pod start single pod by name
- 03-26-2024 11:41 in TOP-LEVEL [AfterEach] Podman pod start podman pod start single pod by name
- 03-23-2024 21:48 in TOP-LEVEL [AfterEach] Podman kube play with auto update annotations for first container only
- 03-23-2024 21:48 in TOP-LEVEL [AfterEach] Podman kube play test volumes-from annotation with source container in pod
- 03-19-2024 12:02 in TOP-LEVEL [AfterEach] Podman run podman run a container based on remote image
- 03-05-2024 15:21 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
- 03-04-2024 07:35 in TOP-LEVEL [AfterEach] Podman start podman container start single container by id
rawhide : int podman rawhide rootless host sqlite
- 07-22 10:58 in TOP-LEVEL [AfterEach] Podman kube play with image data
- 06-05-2024 10:05 in TOP-LEVEL [AfterEach] Podman kube play with configmap in multi-doc yaml and files uses all env values from both sources
- 06-05-2024 10:05 in TOP-LEVEL [AfterEach] Podman kube play test volumes-from annotation with source containers external
- 05-23-2024 08:18 in TOP-LEVEL [AfterEach] Podman kube play test get all key-value pairs from configmap as envs
- 04-01-2024 15:46 in TOP-LEVEL [AfterEach] Podman kube play with image data
- 04-01-2024 13:27 in TOP-LEVEL [AfterEach] Podman kube play test with reserved privileged annotation in yaml
- 03-20-2024 13:37 in TOP-LEVEL [AfterEach] Podman kube play replace
- 03-20-2024 10:22 in TOP-LEVEL [AfterEach] Podman kube play override with udp should keep tcp from YAML file
- 03-19-2024 17:39 in TOP-LEVEL [AfterEach] Podman kube play cap add
- 03-19-2024 17:39 in TOP-LEVEL [AfterEach] Podman kube play override with tcp should keep udp from YAML file
- 03-04-2024 13:48 in TOP-LEVEL [AfterEach] Podman kube play test with init containers and annotation set
- 03-04-2024 07:35 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1

x	x	x	x	x	x
int(76)	podman(78)	rawhide(26)	rootless(40)	host(71)	sqlite(65)
sys(2)		fedora-39(24)	root(38)	container(7)	boltdb(13)
		debian-13(12)
		fedora-40(10)
		fedora-38(6)

edsantiago commented 2 months ago

Just as an FYI, this is the past week's worth of this bug, the instances where "netns ref counter out of sync" does not appear in error message:

debian-13 : int podman debian-13 root host sqlite
- 08-01 08:08 in TOP-LEVEL [AfterEach] Podman kube play with replicas limits the count to 1 and emits a warning
debian-13 : int podman debian-13 rootless host sqlite
- 07-31 11:15 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
- 07-31 11:15 in TOP-LEVEL [AfterEach] Podman start podman start single container by id
fedora-40 : int podman fedora-40 root container sqlite
- 07-31 17:47 in TOP-LEVEL [AfterEach] Podman start podman start container with special pidfile
- 07-31 11:16 in TOP-LEVEL [AfterEach] Podman container clone podman container clone basic test
- 07-30 07:45 in TOP-LEVEL [AfterEach] Podman cp podman cp file
fedora-40 : int podman fedora-40 root host sqlite
- 08-01 07:10 in TOP-LEVEL [AfterEach] Podman play kube with build Check that image is built using Containerfile
- 07-31 17:48 in TOP-LEVEL [AfterEach] Podman pod start podman pod start single pod by name
fedora-40 : int podman fedora-40 rootless host sqlite
- 07-31 10:19 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
rawhide : int podman rawhide root host sqlite
- 07-31 17:50 in TOP-LEVEL [AfterEach] Podman kube play support List kind
- 07-31 11:20 in TOP-LEVEL [AfterEach] Podman pod start podman pod start single pod by name
rawhide : int remote rawhide root host sqlite [remote]
- 08-01 07:12 in Podman events podman events network connection

x	x	x	x	x	x
int(12)	podman(11)	fedora-40(6)	root(9)	host(9)	sqlite(12)
	remote(1)	rawhide(3)	rootless(3)	container(3)
		debian-13(3)

edsantiago commented 2 months ago

Funny observation about this one, and I'm not 100% certain, but I think that when I see this one (the one without "netns sync"), (1) it happens in pairs on the same CI job, and (2) I see failed to get ips. See here and compare the two failures. HTH.

Luap99 commented 2 months ago

Note that netns ref counter out of sync is a rootless only thing so you should never see that as root regardless. I see some issues around the ref counter going side wise on errors but this needs some root cause error first, this one errors seems like a good candidate for that which is then cascading further errors such as the missing ipam entries, etc...

Luap99 commented 1 month ago

Fixed in https://github.com/containers/podman/pull/23519

containers / podman

podman stop: Unable to clean up network: unmounting network namespace: device or resource busy #19721

Seen in: int podman fedora-37/fedora-38/rawhide root/rootless container/host boltdb/sqlite

So this is not something I can understand, the unlink should not fail with EBUSY if it is unmounted.

int podman fedora-38 root host boltdb

int podman fedora-39 rootless host sqlite

int podman rawhide root host sqlite