containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.84k stars 2.42k forks source link

podman stop: Unable to clean up network: netavark: remove aardvark entries: check aardvark-dns netns: IO error: Permission denied #22103

Closed edsantiago closed 7 months ago

edsantiago commented 8 months ago

This is one of those nasty ones that hides in logs, making it impossible for me to get full data.

Best I can tell, the first instance was Feb 9, in rawhide rootless. Seen also in f39 root.

$ podman [options] stop --all -t 0
time="2024-03-20T12:25:48-05:00" level=error msg="Unable to clean up network for container SHA: \"1 error occurred:\\n\\t* netavark: remove aardvark entries: check aardvark-dns netns: IO error: Permission denied (os error 13)\\n\\n\""

Incomplete list below. There are maybe 3-4 others, it is way too hard to get a complete list.

x x x x x x
int(3) podman(3) fedora-39(2) root(2) host(3) sqlite(3)
rawhide(1) rootless(1)
Luap99 commented 8 months ago

I don't get why it would fail with EACCES even as root. These are the only two lines that could fail https://github.com/containers/netavark/blob/cc3f35d2e87defa2e12d0ffeb59a57035e8a5902/src/dns/aardvark.rs#L131-L132

And I really do not see why this would fail with anything other the ENOENT which is already ignored by the code. I can see the EACCES might happen as rootless in case where the aardvark pid was already reused by another process where we do not have privs on, but as root that can never be the case.

Luap99 commented 8 months ago

ok I guess we need to ignore more errors, I am using something this to reproduce the logic easily: while :; do sleep 10 & kill -HUP $! && ls -l /proc/$!/ns/net 2>&1 | tee /dev/stderr | grep -E "No such file or directory|net:" || break ; done I wrongly assumed the only error can be ENOENT, however during testing this several times I also saw ESRCH and importantly the here reported EACCES.

So at this point I wonder if it makes sense to not simply ignore all errors. This check is only a nice to have to make us aware of a inconsistent aardvark-dns vs rootless-netns state: https://github.com/containers/podman/issues/20396.

edsantiago commented 7 months ago

ping

x x x x x x
int(3) podman(3) rawhide(2) rootless(2) host(3) sqlite(3)
fedora-39(1) root(1)
Luap99 commented 7 months ago

https://github.com/containers/netavark/pull/956

edsantiago commented 2 months ago

Looks like the same bug, except ENOENT instead of EACCESS:

# podman [options] stop --all -t 0
[cid1]
Error: removing container [cid2] network: netavark: remove aardvark entries: failed to get aardvark pid: IO error: No such file or directory (os error 2)

In f40 root. File a new bug, or reopen this one?

Luap99 commented 2 months ago

I saw that earlier, we can reopen this but on stop it is working differently and I very much fear that there is no way around these races until https://github.com/containers/aardvark-dns/issues/338 is addressed