containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.35k stars 2.38k forks source link

podman stop: Unable to clean up network: unmounting network namespace: device or resource busy #19721

Closed edsantiago closed 1 month ago

edsantiago commented 1 year ago

Seeing this quite often in my PR with #18442 cherrypicked:

? Enter [AfterEach] TOP-LEVEL
$ podman [options] stop --all -t 0
time="2023-08-22T18:28:28-05:00" level=error msg="Unable to clean up network for container XYZ:
    "unmounting network namespace for container XYZ:
        failed to remove ns path /run/user/3418/netns/netns-9965a9b5-facb-7fe9-44e3-f99ec7d69365:
            remove /run/user/3418/netns/netns-9965a9b5-facb-7fe9-44e3-f99ec7d69365: device or resource busy\""

Example: int f38 rootless.

It almost always happens together with the "Storage ... removed" flake (#19702), e.g.:

So I mostly file under that issue, because my flake tool has no provision for multiple buckets.

No pattern yet (that I can see) in when it fails, which tests, etc.

Luap99 commented 1 year ago

EBUSY should only be the error if the netns is still mounted. However we umount directly before the remove call so that should not cause problems.

The umount call uses MNT_DETACH so it may not actually umount right a away. I have no idea why this flag is used and how this interacts with the nsfs mount points.

edsantiago commented 1 year ago

Seeing this a lot, but again, usually with #19702 ("Storage ... removed") and I can't dual-assign flakes. So since that one already has a ton of sample failures, I'm going to start assigning flakes here. Maybe alternating. Here are last night's failures:

One of those also includes this fun set of errors, does this help?

time="2023-08-23T21:47:02-05:00" level=error msg="IPAM error: failed to get ips for container ID 59fb21294fac284ee4dddf37c0ecb98e27118bdb0274314170465c452f2e8dce on network podman-default-kube-network"
           time="2023-08-23T21:47:02-05:00" level=error msg="IPAM error: failed to find ip for subnet 10.89.0.0/24 on network podman-default-kube-network"
           time="2023-08-23T21:47:02-05:00" level=error msg="tearing down network namespace configuration for container 59fb21294fac284ee4dddf37c0ecb98e27118bdb0274314170465c452f2e8dce: netavark: open container netns: open /run/netns/netns-0de355c1-cf59-05e0-c2cd-279f59a296f3: IO error: No such file or directory (os error 2)"
edsantiago commented 1 year ago

I've got a little more data. There seem to be two not-quite-identical failure modes, depending on root or rootless:

f38 rootless:

Unable to clean up network for container XYZ:
    unmounting network namespace for container XYZ:
    failed to remove ns path /run/user/3418/netns/netns-9965a9b5-facb-7fe9-44e3-f99ec7d69365:
    remove /run/user/3418/netns/netns-9965a9b5-facb-7fe9-44e3-f99ec7d69365: device or resource busy

f38 root:

IPAM error: failed to get ips for container ID XYZ on network podman
IPAM error: failed to find ip for subnet 10.88.0.0/16 on network podman
tearing down network namespace configuration for container XYZ:
    netavark: open container netns:
    open /run/netns/netns-050fe4d4-d5da-48ff-3b6b-9b4a1f445235: IO error: No such file or directory (os error 2)
Unable to clean up network for container XYZ:
    unmounting network namespace for container XYZ:
    failed to unmount NS: at /run/netns/netns-050fe4d4-d5da-48ff-3b6b-9b4a1f445235: no such file or directory

That is:

  1. root has two IPAM errors
  2. root has a "tearing down ... open container netns" error
  3. rootless says "failed to remove ns path ... EBUSY", root says "failed to unmount NS ... ENOENT"

They're too similar for me to split this into two separate issues, but I'll listen to the opinion of experts.

HTH.

rhatdan commented 1 year ago

Do you think you can remove the MNT_DETATCH?

Luap99 commented 1 year ago

I have no idea why it was added in the first place, maybe it is needed?

Git blame goes all the way back to 8c52aa15f0e4927c0e570102efaa34dbe93d6156 which claims it needs MNT_DETATCH but provides no explanation at all why? @mheon This code was forked from cni upstream and that one never used it so...


f38 root:

IPAM error: failed to get ips for container ID XYZ on network podman
IPAM error: failed to find ip for subnet 10.88.0.0/16 on network podman
tearing down network namespace configuration for container XYZ:
    netavark: open container netns:
    open /run/netns/netns-050fe4d4-d5da-48ff-3b6b-9b4a1f445235: IO error: No such file or directory (os error 2)
Unable to clean up network for container XYZ:
    unmounting network namespace for container XYZ:
    failed to unmount NS: at /run/netns/netns-050fe4d4-d5da-48ff-3b6b-9b4a1f445235: no such file or directory

That is:

1. root has two IPAM errors

2. root has a "tearing down ... open container netns" error

The root issue most be something entirely else, symptoms look like we try to cleanup twice.

edsantiago commented 1 year ago

WHEW! After much suffering, I removed MNT_DETACH. That causes absolutely everything to fail hard, even system tests which so far have been immune to this flake.

mheon commented 1 year ago

I think I originally added the MNT_DETACH flag because we were seeing intermittent failures during cleanup due to the namespace still being in use, and I was expecting that making the unmount lazy would resolve things.

edsantiago commented 1 year ago

I'm giving up on this: I am pulling the stderr-on-teardown checks from my flake-check PR. It's too much, costing me way too much time between this and #19702. Until these two are fixed, I can't justify the time it takes me to sort through these flakes.

FWIW, here is the catalog so far:

Seen in: int podman fedora-37/fedora-38/rawhide root/rootless container/host boltdb/sqlite

github-actions[bot] commented 1 year ago

A friendly reminder that this issue had no activity for 30 days.

edsantiago commented 1 year ago

Seen in: int/sys fedora-37/fedora-38/rawhide root/rootless container/host boltdb/sqlite

Luap99 commented 1 year ago

I see different errors posted here that are not the same!

The original report says: remove /run/user/3418/netns/netns-9965a9b5-facb-7fe9-44e3-f99ec7d69365: device or resource busy\""

Just to confirm I looked at ip netns which also does a simple bind mount for the netns paths and to delete they also simply call umount with MNT_DETACH followed by unlink() which is exactly what our code does as well.

https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/tree/ip/ipnetns.c#n735 https://github.com/containers/common/blob/6856d56252121a665cb820777982cc3f61f815af/pkg/netns/netns_linux.go#L191-L197

So this is not something I can understand, the unlink should not fail with EBUSY if it is unmounted.

But I also see: failed to unmount NS: at /run/user/3138/netns/netns-fa758949-40a2-a5e2-2226-5523b2a4c0e7: no such file or directory These are two different things. no such file or directory means we are trying to cleanup again after something else has already cleaned that up. This matches with the other error that is logged when this happens: ="Storage for container 86154f69fa1feff30896274cc265e37fa78745adb1bd2778927053f8bbe7be36 has been removed"

The same goes for errors like this:

time="2023-08-23T21:47:02-05:00" level=error msg="IPAM error: failed to get ips for container ID 59fb21294fac284ee4dddf37c0ecb98e27118bdb0274314170465c452f2e8dce on network podman-default-kube-network"
           time="2023-08-23T21:47:02-05:00" level=error msg="IPAM error: failed to find ip for subnet 10.89.0.0/24 on network podman-default-kube-network"
           time="2023-08-23T21:47:02-05:00" level=error msg="tearing down network namespace configuration for container 59fb21294fac284ee4dddf37c0ecb98e27118bdb0274314170465c452f2e8dce: netavark: open container netns: open /run/netns/netns-0de355c1-cf59-05e0-c2cd-279f59a296f3: IO error: No such file or directory (os error 2)"

Here there must be a way that we for whatever reason end up in the cleanup path twice.

mhoran commented 1 year ago

I do see the device or resource busy error every now and then. Oddly, just recently it seemed to clear itself up, eventually. When I podman unshare I cannot umount the file, nor remove it. mount reports not mounted; rm reports device or resource busy. However, I believe from outside the namespace, I can delete the file. lsns --type=net does not show any corresponding NSFS. Quite odd indeed.

edsantiago commented 1 year ago

I see different errors posted here that are not the same!

That is my fault, and I'm sorry. When scanning flakes to assign them to buckets, I look for common patterns but don't always compare everything exactly. I will be more careful.

edsantiago commented 11 months ago

New flake. Not quite the same error message, but similar enough that I'm assigning here. From f39 rootless:

[+1802s] not ok 599 podman kube --network
...
<+010ms> # $ podman pod rm -t 0 -f test_pod
<+197ms> # time="2023-10-25T08:47:25-05:00" level=error msg="Unable to clean up network for container [infractrid]: \"unmounting network namespace for container [infractrid]: failed to unmount NS: at /run/user/6452/netns/netns-whatever: no such file or directory\""
         # 6b2f9b7e780b9986feb72921c5d8d059687b831537aa1a14d0456b3d67b3d3ee
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #| FAIL: Command succeeded, but issued unexpected warnings
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
edsantiago commented 10 months ago

Another ENOENT, in f39 rootless:

[+1763s] not ok 607 [700] podman kube play --replace external storage
...
<+011ms> # $ podman stop -a -t 0
<+451ms> # 5562c7db09b37c167612109c79237ac32f7fb7575d0b9819ad8949b8499e4cec
         # b48e32245140e2fbe17564503b67b1ff5c1263e2fd4aa3fcdea8afde28c5a626
         #
<+011ms> # $ podman pod rm -t 0 -f test_pod
<+190ms> # time="2023-11-10T13:36:19-06:00" level=error msg="getting rootless network namespace: failed to Statfs \"/run/user/2815/netns/rootless-netns-bfe0fe1f8f170aff795c\": no such file or directory"
         # time="2023-11-10T13:36:19-06:00" level=error msg="Unable to clean up network for container 5562c7db09b37c167612109c79237ac32f7fb7575d0b9819ad8949b8499e4cec: \"unmounting network namespace for container 5562c7db09b37c167612109c79237ac32f7fb7575d0b9819ad8949b8499e4cec: failed to unmount NS: at /run/user/2815/netns/netns-6ae41715-c698-4a92-8658-020349c94f6f: no such file or directory\""
         # 19faf1a389d250f9a4a71fcdae449d6f2c38a89e2e51acc9bde5ec912db1093f
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #| FAIL: Command succeeded, but issued unexpected warnings
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
edsantiago commented 10 months ago

The latest list. Note that some of these are ENOENT, and some are EBUSY. Until given a reason to treat these as different bugs, I will continue to lump them together.

x x x x x x
int(29) podman(30) fedora-38(20) root(19) host(30) boltdb(16)
sys(1) rawhide(4) rootless(11) sqlite(14)
fedora-39(3)
fedora-39β(2)
fedora-37(1)
piotr-kubiak commented 8 months ago

The root issue most be something entirely else, symptoms look like we try to cleanup twice.

If that provides any clues, I am observing simmilar behaviour when the container is configured with restart policy unless-stopped.

edsantiago commented 7 months ago

With #18442, this is now blowing up. Different error messages, but I'm pretty sure it's all the same bug.

New EACCESS variant:

# podman [options] stop --all -t 0
time="2024-03-04T06:21:40-06:00" level=error msg="Unable to clean up network for container 4f387d5ee492fa77fa287eadbbdb6725aa4e24b879de087cd6d89b6f59014e84: \"netavark: remove aardvark entries: check aardvark-dns netns: IO error: Permission denied (os error 13)\""

and the old ENOENT variant:

# podman [options] stop --all -t 0
time="2024-03-04T06:26:19-06:00" level=error msg="IPAM error: failed to get ips for container ID f5d3911edc288ff629303bfb7d9bb18224d755b2c80fb3fda9cdd6827efe94fe on network podman"
time="2024-03-04T06:26:19-06:00" level=error msg="IPAM error: failed to find ip for subnet 10.88.0.0/16 on network podman"
time="2024-03-04T06:26:19-06:00" level=error msg="netavark: open container netns: open /run/netns/netns-8fd7cfa2-c3d6-d361-98ca-bdf199cb29f3: IO error: No such file or directory (os error 2)"
time="2024-03-04T06:26:19-06:00" level=error msg="Unable to clean up network for container f5d3911edc288ff629303bfb7d9bb18224d755b2c80fb3fda9cdd6827efe94fe: \"unmounting network namespace for container f5d3911edc288ff629303bfb7d9bb18224d755b2c80fb3fda9cdd6827efe94fe: failed to remove ns path: remove /run/netns/netns-8fd7cfa2-c3d6-d361-98ca-bdf199cb29f3: no such file or directory, failed to unmount NS: at /run/netns/netns-8fd7cfa2-c3d6-d361-98ca-bdf199cb29f3: no such file or directory\""

...and an ENOENT variant with a shorter error message:

$ podman [options] stop --all -t 0
time="2024-03-04T06:27:52-06:00" level=error msg="Unable to clean up network for container 52fd74d45cfc07d19cd49cbda438f9e57f07b37bc80627064cffb1bdc02461ad: \"unmounting network namespace for container 52fd74d45cfc07d19cd49cbda438f9e57f07b37bc80627064cffb1bdc02461ad: failed to remove ns path: remove /run/user/2221/netns/netns-db173238-0a50-83c3-bb71-ca4e7d30a28b: no such file or directory, failed to unmount NS: at /run/user/2221/netns/netns-db173238-0a50-83c3-bb71-ca4e7d30a28b: no such file or directory\""

Here are today's failures, plus one from January

x x x x x x
int(3) podman(4) fedora-39(2) root(2) host(4) sqlite(4)
sys(1) rawhide(2) rootless(2)
edsantiago commented 6 months ago

Still happening with brand-new (March 19) VMs, Three failures in just one CI run:

int podman fedora-38 root host boltdb

int podman fedora-39 rootless host sqlite

int podman rawhide root host sqlite

edsantiago commented 6 months ago

Is this (f39 rootless) the same error???

$ podman [options] stop --all -t 0
time="2024-03-20T12:25:48-05:00" \
    level=error \
    msg="Unable to clean up network for container SHA: \
        \"1 error occurred:\\n
        \\t* netavark: remove aardvark entries: check aardvark-dns netns: IO error: Permission denied (os error 13)\\n\\n\""

I'm going to assume it is, and file it as such, unless told otherwise.

Luap99 commented 6 months ago

Is this (f39 rootless) the same error???

$ podman [options] stop --all -t 0
time="2024-03-20T12:25:48-05:00" \
    level=error \
    msg="Unable to clean up network for container SHA: \
        \"1 error occurred:\\n
        \\t* netavark: remove aardvark entries: check aardvark-dns netns: IO error: Permission denied (os error 13)\\n\\n\""

I'm going to assume it is, and file it as such, unless told otherwise.

No that is most likely something different

edsantiago commented 6 months ago

My periodic ping. This seems to be happening a lot more with recent VMs

x x x x x x
int(18) podman(18) rawhide(10) rootless(10) host(18) sqlite(16)
fedora-39(6) root(8) boltdb(2)
fedora-38(2)
edsantiago commented 4 months ago

Where are we on this one? I just saw this failure, f40 root, on VMs from https://github.com/containers/automation_images/pull/349 with netavark 1.10.3-3:

# podman [options] stop --all -t 0
time="2024-05-08T08:40:48-05:00" level=error msg="IPAM error: failed to get ips for container ID 4966574a794623c18a431d97adf2ea6192819e96755529acd35a529670985b69 on network podman"
time="2024-05-08T08:40:48-05:00" level=error msg="IPAM error: failed to find ip for subnet 10.88.0.0/16 on network podman"
time="2024-05-08T08:40:48-05:00" level=error msg="netavark: open container netns: open /run/netns/netns-dce2075a-1bca-f84d-c945-d1bc1641f2f6: IO error: No such file or directory (os error 2)"
time="2024-05-08T08:40:48-05:00" level=error msg="Unable to clean up network for container 4966574a794623c18a431d97adf2ea6192819e96755529acd35a529670985b69: \"unmounting network namespace for container 4966574a794623c18a431d97adf2ea6192819e96755529acd35a529670985b69: failed to remove ns path: remove /run/netns/netns-dce2075a-1bca-f84d-c945-d1bc1641f2f6: no such file or directory, failed to unmount NS: at /run/netns/netns-dce2075a-1bca-f84d-c945-d1bc1641f2f6: no such file or directory\""

(error in this one is ENOENT, not EBUSY)

Luap99 commented 4 months ago

FYI: The reason you see this more is because I enabled the warnings check in AfterEach() https://github.com/containers/podman/pull/18442

So previously we just didn't see them, in the logs above they all failed in AfterEach, as mentioned before and in other issues the problem is that something tries to cleanup twice but I cannot see why and where this would be happening.

edsantiago commented 4 months ago

Here's one with a lot more context, does that help? (Three total failures in this log, so be sure to Page-End then click on each individual failure)

Luap99 commented 4 months ago

Here's one with a lot more context, does that help? (Three total failures in this log, so be sure to Page-End then click on each individual failure)

Not really all these messages are just complaining that they do not find what they would expect from it if you ask for cleanup. The weird part is the sequence, why does it fail only on pod rm because before that we run stop --all -t 0 which should stop all the container and do the cleanup. So then why does pod rm do the cleanup again?

Luap99 commented 4 months ago

@edsantiago Can you spin this patch through your testing PR:

diff --git a/libpod/container_internal.go b/libpod/container_internal.go
index 7f909b78d..ce5dbe0ef 100644
--- a/libpod/container_internal.go
+++ b/libpod/container_internal.go
@@ -1396,13 +1396,13 @@ func (c *Container) stop(timeout uint) error {
        if cannotStopErr != nil {
                return cannotStopErr
        }
-       if !c.batched {
+       if !c.batched && timeout > 0 {
                c.lock.Unlock()
        }

        stopErr := c.ociRuntime.StopContainer(c, timeout, all)

-       if !c.batched {
+       if !c.batched && timeout > 0 {
                c.lock.Lock()
                if err := c.syncContainer(); err != nil {
                        if errors.Is(err, define.ErrNoSuchCtr) || errors.Is(err, define.ErrCtrRemoved) {

Not really a fix as I don't understand the cause but this is one place where we unlock during a critical section so if we no longer see the issues with that than at least I know where it starts to go wrong.

edsantiago commented 4 months ago

Good timing. I just patched & pushed it out. If I see ~5 runs without any of these flakes, I will report back. (Or, of course, if I see the flake even with this patch).

edsantiago commented 4 months ago

Sorry

Luap99 commented 4 months ago

Sorry

No need to be sorry, knowing that is good enough for me. And also the fact that this is fails on a simple podman create + start test is very good, I was suspecting some weird pod/container dependencies issues or something with the restart policy always but this seem to exclude such possibilities.

edsantiago commented 4 months ago

The "pid file blah" message is new, does it help? In f39 rootless:

   $ podman [options] pod rm -fa -t 0
   time="2024-05-23T07:10:51-05:00" level=error msg="IPAM error: failed to get ips for container ID 5d282f67ae0c245f323779c5a57c579bd6de58bc7fbb688da46ffec96e8cc7b9 on network podman-default-kube-network"
   time="2024-05-23T07:10:51-05:00" level=error msg="rootless netns ref counter out of sync, counter is at -1, resetting it back to 0"
>> time="2024-05-23T07:10:51-05:00" level=warning msg="failed to read rootless netns program pid: open /tmp/CI_E7Th/podman-e2e-1332300349/subtest-2709937137/runroot/networks/rootless-netns/rootless-netns-conn.pid: no such file or directory"
   time="2024-05-23T07:10:51-05:00" level=error msg="IPAM error: failed to find ip for subnet 10.89.0.0/24 on network podman-default-kube-network"
   time="2024-05-23T07:10:51-05:00" level=error msg="1 error occurred:\n\t* netavark: open container netns: open /run/user/4848/netns/netns-496a4e9f-0424-fab8-8bd4-9ffb106b0d86: IO error: No such file or directory (os error 2)\n\n"
   time="2024-05-23T07:10:51-05:00" level=error msg="Unable to clean up network for container 5d282f67ae0c245f323779c5a57c579bd6de58bc7fbb688da46ffec96e8cc7b9: \"unmounting network namespace for container 5d282f67ae0c245f323779c5a57c579bd6de58bc7fbb688da46ffec96e8cc7b9: failed to remove ns path: remove /run/user/4848/netns/netns-496a4e9f-0424-fab8-8bd4-9ffb106b0d86: no such file or directory, failed to unmount NS: at /run/user/4848/netns/netns-496a4e9f-0424-fab8-8bd4-9ffb106b0d86: no such file or directory\""
Luap99 commented 4 months ago

No not really

francoism90 commented 3 months ago

Any workarounds? I have the same issue.

Luap99 commented 3 months ago

Any workarounds? I have the same issue.

This is (mostly) a flake tracker for CI, if you have a reproducer for this issue I would be very happy if you can share it. Then maybe I can understand the issue and see if there is a workaround/fix.

francoism90 commented 3 months ago

@Luap99 It seems indeed to be happen when using a wrong configuration.

Sometimes the only workaround is to fully reboot the machine, as the rootlessport process isn't being killed or even killable.

Luap99 commented 3 months ago

@Luap99 It seems indeed to be happen when using a wrong configuration.

Sometimes the only workaround is to fully reboot the machine, as the rootlessport process isn't being killed or even killable.

Can you share podman commands, what does wrong configuration mean? I also don't see how this is related rootlessport. The rootlessport process should exit on its own when conmon after conmon exits. If it does not do that then it is a bug, however it should always be killable via SIGKILL I assume.

francoism90 commented 3 months ago

@Luap99 I'm using Podman Quadlet.

The problem occurs when a container fails to run at some point in the startup. Like you accidentally created a configuration mistake.

It's possible to stop the container, but for some reason the network process doesn't know what to do anymore, and you'll end up the IPAM error or cannot clean network.

The only way is to fully kill all the network related process for Podman, but even that may not be enough. By rebooting the machine, the container and it's configured network starts just fine.

I haven't used Docker rootless on Linux for a long time, but I can remember the same problem. It just seems to lose track of the networking and doesn't know what is actually killed or not. As I said before, the rootlessport processes stay open, even when you stop the container.

francoism90 commented 3 months ago

An example to trigger this, is by using a webserver image, nginx-alpine:latest for example. Make sure it's using a network, running rootless, and also use requires=nginx as service option for other containers that communicate with the container.

Add a site/server-block in the nginx container, and make a mistake in it's configuration. For example listen 8080 instead of listen 8080; (notice the missing ending ;). Something that can be easily overseen. I don't know if invalid proxy address also triggers it.

When you now start the container(s), it will crash the main process, because of a configuration error. So you stop the container, adjust the configuration, but when starting again, you'll end up with network errors.

Sometimes it can be fixed by killing all process, like rootlessport and things connected to it. But sometimes you really have to reboot to recover. After a reboot everything works as expected.

This is my full config, if you're interested: https://github.com/foxws/foxws/tree/main/podman

edsantiago commented 2 months ago

This one has happening A LOT in my no-retry PR. And, in case it helps, also in my parallel-system-tests PR.

x x x x x x
int(76) podman(78) rawhide(26) rootless(40) host(71) sqlite(65)
sys(2) fedora-39(24) root(38) container(7) boltdb(13)
debian-13(12)
fedora-40(10)
fedora-38(6)
edsantiago commented 2 months ago

Just as an FYI, this is the past week's worth of this bug, the instances where "netns ref counter out of sync" does not appear in error message:

x x x x x x
int(12) podman(11) fedora-40(6) root(9) host(9) sqlite(12)
remote(1) rawhide(3) rootless(3) container(3)
debian-13(3)
edsantiago commented 2 months ago

Funny observation about this one, and I'm not 100% certain, but I think that when I see this one (the one without "netns sync"), (1) it happens in pairs on the same CI job, and (2) I see failed to get ips. See here and compare the two failures. HTH.

Luap99 commented 2 months ago

Note that netns ref counter out of sync is a rootless only thing so you should never see that as root regardless. I see some issues around the ref counter going side wise on errors but this needs some root cause error first, this one errors seems like a good candidate for that which is then cascading further errors such as the missing ipam entries, etc...

Luap99 commented 1 month ago

Fixed in https://github.com/containers/podman/pull/23519