Closed czarnas closed 1 week ago
Is the aardvark-dns process still running when the dns stop working? Does ss -tulpn
shows it listing on port 53?
Yes in both cases:
❯ ss -tulpn | grep :53
udp UNCONN 213248 0 10.89.0.1:53 0.0.0.0:* users:(("aardvark-dns",pid=42323,fd=11))
udp UNCONN 0 0 127.0.0.54:53 0.0.0.0:* users:(("systemd-resolve",pid=973,fd=19))
udp UNCONN 0 0 127.0.0.53%lo:53 0.0.0.0:* users:(("systemd-resolve",pid=973,fd=17))
udp UNCONN 0 0 0.0.0.0:5355 0.0.0.0:* users:(("systemd-resolve",pid=973,fd=10))
udp UNCONN 0 0 [::]:5355 [::]:* users:(("systemd-resolve",pid=973,fd=12))
tcp LISTEN 0 4096 127.0.0.54:53 0.0.0.0:* users:(("systemd-resolve",pid=973,fd=20))
tcp LISTEN 0 1024 10.89.0.1:53 0.0.0.0:* users:(("aardvark-dns",pid=42323,fd=12))
tcp LISTEN 0 4096 127.0.0.53%lo:53 0.0.0.0:* users:(("systemd-resolve",pid=973,fd=18))
tcp LISTEN 0 4096 0.0.0.0:5355 0.0.0.0:* users:(("systemd-resolve",pid=973,fd=11))
tcp LISTEN 0 4096 [::]:5355 [::]:* users:(("systemd-resolve",pid=973,fd=13))
❯ ps -aux |grep aardvark
root 42323 0.0 0.0 276424 3384 ? Ssl Aug27 0:00 /usr/libexec/podman/aardvark-dns --config /run/containers/networks/aardvark-dns -p 53 run
udp UNCONN 213248 0 10.89.0.1:53 0.0.0.0:* users:(("aardvark-dns",pid=42323,fd=11))
Recv-Q value seems way to high which seems to suggest we no longer read anything of the socket so some form of logic bug.
If you drop the -l
from the ss call so ss -tupn
what open connections do you see for aardvark-dns?
There is only one active:
❯ ss -tupn | grep aardvark
tcp ESTAB 0 0 10.89.0.1:53 10.89.0.23:40292 users:(("aardvark-dns",pid=42323,fd=5))
Can you check what container uses 10.89.0.23
? podman network inspect <network_name>
should show you the attached containers with its ip addresses listed. It seems odd that this tcp connection is open for so long.
This makes sense to me then, we listen async to either a incoming udp/tcp connection so it never processes two connection at the same time, so as long as the tcp connection doesn't send any data we just do not thing. That is most likely something we should fix.
But it would be good to know where it hangs, can you run gdb -p <aadvark-dns-pid> -ex="thread apply all bt" -batch
and show me the output?
Also sounds like there is a tcpkill
that you could try to use to close the open tcp connection, I would think that makes it work again.
10.89.0.23 is occupied by STALWART container:
"6e83fc72587b6d61da0f907672570afb29f7ccd1275ff9454033646926f8fc11": {
"name": "stalwart",
"interfaces": {
"eth0": {
"subnets": [
{
"ipnet": "10.89.0.23/24",
"gateway": "10.89.0.1"
}
],
"mac_address": "8a:6c:1b:4c:d8:23"
}
}
},
I don't have GDB installed on the machine. I will install it, which requires machine restart (rpm-ostree) and come back with output after it "hangs" again.
I don't have GDB installed on the machine. I will install it, which requires machine restart (rpm-ostree) and come back with output after it "hangs" again.
You can start a container with --pid=host --privileged
and install/use gdb there
podman run --pid=host --privileged haggaie/gdb gdb -p 42323 -ex="thread apply all bt" -batch
[New LWP 42324]
[New LWP 42325]
[New LWP 42326]
[New LWP 42327]
warning: Unable to find dynamic linker breakpoint function.
GDB will be unable to debug shared library initializers
and track explicitly loaded dynamic code.
0x00007fbd86e7a3dd in ?? ()
Thread 5 (LWP 42327):
#0 0x00007fbd86e7a3dd in ?? ()
containers/podman#1 0x0000558c7ec353c2 in ?? ()
containers/podman#2 0x00000000ffffffff in ?? ()
containers/podman#3 0x0000000000000000 in ?? ()
Thread 4 (LWP 42326):
#0 0x00007fbd86e7a3dd in ?? ()
containers/podman#1 0x0000558c7ec353c2 in ?? ()
containers/podman#2 0x00000000ffffffff in ?? ()
containers/podman#3 0x0000000000000000 in ?? ()
Thread 3 (LWP 42325):
#0 0x00007fbd86e7ca32 in ?? ()
containers/podman#1 0x00007fbd867ff890 in ?? ()
containers/podman#2 0xffffffff7ebb93b2 in ?? ()
containers/podman#3 0x0000558cab1380e0 in ?? ()
containers/podman#4 0x0000000400000400 in ?? ()
containers/podman#5 0x00007fbd867ff7f0 in ?? ()
containers/podman#6 0x0000558c7ebc3a2f in ?? ()
containers/podman#7 0x0000000000000000 in ?? ()
Thread 2 (LWP 42324):
#0 0x00007fbd86e7a3dd in ?? ()
containers/podman#1 0x0000558c7ec353c2 in ?? ()
containers/podman#2 0x00000000ffffffff in ?? ()
containers/podman#3 0x0000000000000000 in ?? ()
Thread 1 (LWP 42323):
#0 0x00007fbd86e7a3dd in ?? ()
containers/podman#1 0x0000558c7ec353c2 in ?? ()
containers/podman#2 0x00007ffeffffffff in ?? ()
containers/podman#3 0x0000000000000000 in ?? ()
[Inferior 1 (process 42323) detached]
I've stopped stalwart container, and it immediately fixed DNS issues. Now, wondering why it hangs up after some time. It used to work flawless.
podman run --pid=host --privileged haggaie/gdb gdb -p 42323 -ex="thread apply all bt" -batch ...
Oh sorry I think you must make sure to use the exact same fedora version image (fedora:40) and then install gdb there so that the linker and such matches.
I've stopped stalwart container, and it immediately fixed DNS issues. Now, wondering why it hangs up after some time. It used to work flawless.
Large parts of arrdvark-dns where rewritten by me for 1.12, most importantly the aardvark-dns didn't even support tcp connections at all before. It is not clear to me why the tcp connection stays open, it may be our fault or the clients but either way we need to fix this in aardvark-dns because a single client should never be allowed to make the server non functional.
I know how to reproduce the tcp hang myself so I do not need to full stack trace from you. I move the issue to the aardvark-dns repo as it is a bug there.
Thank You very much for Your help. Let me know If I can assist in any way further. Any workaround for now would be more than welcome :)
Issue Description
The issue I have is that my podman containers stops resolving internal & external DNS after some time ~1h. If I restart whole podman or reboot system I can resolve all of the dns records and ping between containers or external network. After ~1h I can no longer resolve dns, or ping between containers, or outside network.
I'm running named bridge network.
Issue started to show up after upgrade from podman 5:5.1.2-1.fc40 -> 5:5.2.1-1.fc40 & maybe what's most important netavark 1.11.0-1.fc40 -> 2:1.12.1-1.fc40 aardvark-dns 1.11.0-1.fc40 -> 2:1.12.1-1.fc40
Steps to reproduce the issue
Steps to reproduce the issue
Describe the results you received
Right after container start:
After 1h
Journalctl contains following entries:
Describe the results you expected
Network working all the time
podman info output
Podman in a container
No
Privileged Or Rootless
Privileged
Upstream Latest Release
No
Additional environment details
Running Fedora IoT 40 latest Running through compose
Additional information
Additional information like issue happens only occasionally or issue happens with a particular architecture or on a particular setting