Closed deliciouslytyped closed 11 months ago
Maybe lets take a step back first. And try to reproduce with something simple. Ideally reboot first to prevent any temporary state from conflicting.
After the start does the following work?
$ podman network create testnet
testnet
$ podman run --network testnet --name c1 -d alpine sleep inf
dc517d9f188e51479e127776549d120fc117443e0920b1b74cd86da3e310b0de
$ podman run --network testnet --rm alpine nslookup c1
Server: 10.89.0.1
Address: 10.89.0.1:53
Non-authoritative answer:
Name: c1.dns.podman
Address: 10.89.0.2
Non-authoritative answer:
$ podman run --network testnet --rm alpine nslookup google.com
Server: 10.89.0.1
Address: 10.89.0.1:53
Non-authoritative answer:
Name: google.com
Address: 142.250.184.206
Non-authoritative answer:
Name: google.com
Address: 2a00:1450:4001:812::200e
If you suspect namespace issues then you should check this:
$ podman unshare --rootless-netns readlink /proc/self/ns/net
net:[4026536046]
$ readlink /proc/$(pgrep aardvark-dns)/ns/net
net:[4026536046]
The values must match otherwise it would be totally expected that dns fails.
In general adding --log-level debug
to all your podman commands should give much more clues what is going on during network setup. And aardvark-dns logs to the syslog so if you run systemd you should be able to see the message with journalctl --user SYSLOG_IDENTIFIER=aardvark-dns
Thanks for engaging, sometimes it gives enough of a boost when someone asks. :)
I don't want to reset / reboot my system so we can hopefully actually figure out what's going on here, but it may be irrelevant.
2.terminal 1:
$ podman network create testnet
testnet
$ podman run --rm --network testnet --name c1 quay.io/libpod/alpine sleep 1h
...
3.terminal 2:
$ podman run -it --rm --network testnet --rm quay.io/libpod/alpine /bin/sh
/ # nslookup c1
nslookup: can't resolve '(null)': Name does not resolve
nslookup: can't resolve 'c1': Try again
/ # cat /etc/resolv.conf
search dns.podman
nameserver 10.89.2.1
options edns0
$ lsns -T -t net
NS TYPE NPROCS PID USER NETNSID NSFS COMMAND
4026531837 user 0 root
├─4026531840 net 81 904 nixos unassigned /nix/store/q182qd78kvlijc25fas2v1bwn3gh670k-shell2http-1.16.0/bin/shell2http -include-stderr -show-errors -port 55555 -form /v1/msg└─4026532470 user 0 nixos
├─4026532471 net 1 1390375 nixos unassigned sleep 1h
├─4026532537 net 2 3926339 nixos unassigned /nix/store/kxkdrxvc3da2dpsgikn8s2ml97h88m46-bash-interactive-5.2-p15/bin/bash
├─4026532598 net 3 171765 nixos unassigned /nix/store/kxkdrxvc3da2dpsgikn8s2ml97h88m46-bash-interactive-5.2-p15/bin/bash
├─4026532661 net 1 4065741 nixos unassigned /nix/store/z56sdzjn2s70204pc66bspdp3xi21pfr-podman-4.5.0/libexec/podman/aardvark-dns --config /tmp/containers-user-1000/containers/ └─4026532797 net 1 1394610 nixos unassigned /bin/sh
$ podman unshare --rootless-netns /bin/sh sh-5.2# ip link 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: tap0: <BROADCAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UNKNOWN mode DEFAULT group default qlen 1000 link/ether 6e:3f:b1:6a:cb:2a brd ff:ff:ff:ff:ff:ff 3: podman3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether 4e:7f:07:a6:d7:c8 brd ff:ff:ff:ff:ff:ff 4: veth0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master podman3 state UP mode DEFAULT group default qlen 1000 link/ether 7e:b4:61:d6:f5:67 brd ff:ff:ff:ff:ff:ff link-netnsid 0 12: veth1@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master podman3 state UP mode DEFAULT group default qlen 1000 link/ether 4e:7f:07:a6:d7:c8 brd ff:ff:ff:ff:ff:ff link-netnsid 1
$ podman unshare
[root@nixos:~]# ip link 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000 link/ether 96:00:02:86:80:f6 brd ff:ff:ff:ff:ff:ff
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
4. terminal 2 continued (executing workaround):
/ # exit
$ pkill -9 aardvark-dns
$ podman run -it --rm --network testnet --rm quay.io/libpod/alpine nslookup c1 nslookup: can't resolve '(null)': Name does not resolve
Name: c1 Address 1: 10.89.2.11 c21753cadc57
5. terminal 3 continued (as you can see, aardvark now has interfaces):
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: tap0: <BROADCAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UNKNOWN mode DEFAULT group default qlen 1000 link/ether 6e:3f:b1:6a:cb:2a brd ff:ff:ff:ff:ff:ff 3: podman3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether 7e:b4:61:d6:f5:67 brd ff:ff:ff:ff:ff:ff 4: veth0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master podman3 state UP mode DEFAULT group default qlen 1000 link/ether 7e:b4:61:d6:f5:67 brd ff:ff:ff:ff:ff:ff link-netnsid 0
Ok I checked the systemd --user output that you suggested and it tells us probably the expected based on the above, but it's something! (spammed a decent amount)
Oct 18 14:31:49 nixos aardvark-dns[4065741]: Unable to start server unable to start CoreDns server: Cannot assign requested address (os error 99)
Ok your environment looks wrong then, somehow it mismatches namespaces which shouldn't be possible.
At the very least I see /tmp/containers-user-1000/containers/...
which never looks right if you run on a systemd distro. This really should be /run/user/$UID.... You need a proper systemd login session if you run rootless. And then podman system reset
to clean everything then podman info should report /run/user/$UID under runroot.
This isn't to say that this is a bug or helps here but in general I have seen so many systems with incorrect runroot were things simply break.
Also add --log-level debug
to all the podman commands, there must be a Cleaning up rootless network namespace
message somewhere if it removed the rootless-netns but that should never trigger if at least one container is still running.
Yes I'm running a relatively minimal setup, I'm not familiar with all the details, but I may have "had" to run loginctl enable-linger myuser
manually (I don't remember if that was in one of the nested contexts I was working on.). For whatever reason, $XDG_RUNTIME_DIR isn't set, which I assume is what points to /run/user/...
.
Sometimes there do seem to be problems with taking down network namespaces. Given what you said, I expect it's related.
Also add --log-level debug to all the podman commands, there must be a Cleaning up rootless network namespace message somewhere if it removed the rootless-netns but that should never trigger if at least one container is still running.
So where are you suggesting I should run this? On the workaround restart command?
If nothing else comes out of this, if some additional diagnostics end up added to podman that can help with such situations, I think that would be a good outcome.
Also add --log-level debug to all the podman commands, there must be a Cleaning up rootless network namespace message somewhere if it removed the rootless-netns but that should never trigger if at least one container is still running.
So where are you suggesting I should run this? On the workaround restart command?
On every single podman command, you need to piece the full picture together. The most important one would be the first podman command which ends up creating aardvark-dns in the wrong namespace. But you really need to see where
Having a env where XDG_RUNTIME_DIR is set and not set can certainly confuse things.
The reason I didn't post any debug logs is that I didn't find anything useful in them earlier. Nevertheless it's likely I missed something.
In the meantime, here is the network section of bringup of the first container after stopping all containers, I can give you more of the preceding logs if you need it.
My problem is that my very limited understanding is that aardvark basically runs in it's own container right? How can we see the aardvark container's bringup?
time="2023-10-18T16:07:08Z" level=debug msg="Made network namespace at /run/user/1000/netns/netns-6cc27332-91e6-e96d-3109-514fee50bda0 for container 03769c3dac73954eac5795a72b560605edac7046fc2a6d4838e3b9b4263fe601"
time="2023-10-18T16:07:08Z" level=debug msg="Cached value indicated that volatile is being used"
time="2023-10-18T16:07:08Z" level=debug msg="overlay: mount_data=lowerdir=/home/nixos/.local/share/containers/storage/overlay/l/E4NHQJKSZABVS6PNDXVQC2RUC5,upperdir=/home/nixos/.local/share/containers/storage/overlay/7aaa2baf0081f35a80c67ae3447d09eece1d46f235f7c554b1c973fef6afaea8/diff,workdir=/home/nixos/.local/share/containers/storage/overlay/7aaa2baf0081f35a80c67ae3447d09eece1d46f235f7c554b1c973fef6afaea8/work,,userxattr,volatile"
time="2023-10-18T16:07:08Z" level=debug msg="creating rootless network namespace with name \"rootless-netns-241895256a9b790b13e2\""
time="2023-10-18T16:07:08Z" level=debug msg="Mounted container \"03769c3dac73954eac5795a72b560605edac7046fc2a6d4838e3b9b4263fe601\" at \"/home/nixos/.local/share/containers/storage/overlay/7aaa2baf0081f35a80c67ae3447d09eece1d46f235f7c554b1c973fef6afaea8/merged\""
time="2023-10-18T16:07:08Z" level=debug msg="Created root filesystem for container 03769c3dac73954eac5795a72b560605edac7046fc2a6d4838e3b9b4263fe601 at /home/nixos/.local/share/containers/storage/overlay/7aaa2baf0081f35a80c67ae3447d09eece1d46f235f7c554b1c973fef6afaea8/merged"
time="2023-10-18T16:07:08Z" level=debug msg="slirp4netns command: /nix/store/czkc4zy7f213q8a0la8r59yranl7yw8q-slirp4netns-1.2.0/bin/slirp4netns --disable-host-loopback --mtu=65520 --enable-sandbox --enable-seccomp --enable-ipv6 -c -r 3 --netns-type=path /run/user/1000/netns/rootless-netns-241895256a9b790b13e2 tap0"
time="2023-10-18T16:07:08Z" level=debug msg="The path of /etc/resolv.conf in the mount ns is \"/etc/resolv.conf\""
[DEBUG netavark::network::validation] "Validating network namespace..."
[DEBUG netavark::commands::setup] "Setting up..."
[INFO netavark::firewall] Using iptables firewall driver
[DEBUG netavark::network::bridge] Setup network testnet
[DEBUG netavark::network::bridge] Container interface name: eth0 with IP addresses [10.89.2.27/24]
[DEBUG netavark::network::bridge] Bridge name: podman3 with IP addresses [10.89.2.1/24]
[DEBUG netavark::network::core_utils] Setting sysctl value for net.ipv4.ip_forward to 1
[DEBUG netavark::network::core_utils] Setting sysctl value for /proc/sys/net/ipv6/conf/eth0/autoconf to 0
[INFO netavark::network::netlink] Adding route (dest: 0.0.0.0/0 ,gw: 10.89.2.1, metric 100)
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK-C5964C6794A11 created on table nat
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK_FORWARD created on table filter
[DEBUG netavark::firewall::varktables::helpers] rule -d 10.89.2.0/24 -j ACCEPT created on table nat and chain NETAVARK-C5964C6794A11
[DEBUG netavark::firewall::varktables::helpers] rule ! -d 224.0.0.0/4 -j MASQUERADE created on table nat and chain NETAVARK-C5964C6794A11
[DEBUG netavark::firewall::varktables::helpers] rule -s 10.89.2.0/24 -j NETAVARK-C5964C6794A11 created on table nat and chain POSTROUTING
[DEBUG netavark::firewall::varktables::helpers] rule -d 10.89.2.0/24 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT created on table filter and chain NETAVARK_FORWARD
[DEBUG netavark::firewall::varktables::helpers] rule -s 10.89.2.0/24 -j ACCEPT created on table filter and chain NETAVARK_FORWARD
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK-HOSTPORT-SETMARK created on table nat
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK-HOSTPORT-MASQ created on table nat
[DEBUG netavark::firewall::varktables::helpers] chain NETAVARK-HOSTPORT-DNAT created on table nat
[DEBUG netavark::firewall::varktables::helpers] rule -j MARK --set-xmark 0x2000/0x2000 created on table nat and chain NETAVARK-HOSTPORT-SETMARK
[DEBUG netavark::firewall::varktables::helpers] rule -j MASQUERADE -m comment --comment 'netavark portfw masq mark' -m mark --mark 0x2000/0x2000 created on table nat and chain NETAVARK-HOSTPORT-MASQ
[DEBUG netavark::firewall::varktables::helpers] rule -j NETAVARK-HOSTPORT-DNAT -m addrtype --dst-type LOCAL created on table nat and chain PREROUTING
[DEBUG netavark::firewall::varktables::helpers] rule -j NETAVARK-HOSTPORT-DNAT -m addrtype --dst-type LOCAL created on table nat and chain OUTPUT
[DEBUG netavark::commands::setup] {
"testnet": StatusBlock {
dns_search_domains: Some(
[
"dns.podman",
],
),
dns_server_ips: Some(
[
10.89.2.1,
],
),
interfaces: Some(
{
"eth0": NetInterface {
mac_address: "96:65:ee:64:b2:57",
subnets: Some(
[
NetAddress {
gateway: Some(
10.89.2.1,
),
ipnet: 10.89.2.27/24,
},
],
),
},
},
),
},
}
[DEBUG netavark::commands::setup] "Setup complete"
time="2023-10-18T16:07:08Z" level=debug msg="Adding nameserver(s) from network status of '[\"10.89.2.1\"]'"
time="2023-10-18T16:07:08Z" level=debug msg="Adding search domain(s) from network status of '[\"dns.podman\"]'"
time="2023-10-18T16:07:08Z" level=debug msg="/etc/system-fips does not exist on host, not mounting FIPS mode subscription"
time="2023-10-18T16:07:08Z" level=debug msg="Setting Cgroups for container 03769c3dac73954eac5795a72b560605edac7046fc2a6d4838e3b9b4263fe601 to user.slice:libpod:03769c3dac73954eac5795a72b560605edac7046fc2a6d4838e3b9b4263fe601"
time="2023-10-18T16:07:08Z" level=debug msg="reading hooks from /usr/share/containers/oci/hooks.d"
time="2023-10-18T16:07:08Z" level=debug msg="Workdir \"/\" resolved to host path \"/home/nixos/.local/share/containers/storage/overlay/7aaa2baf0081f35a80c67ae3447d09eece1d46f235f7c554b1c973fef6afaea8/merged\""
time="2023-10-18T16:07:08Z" level=debug msg="Created OCI spec for container 03769c3dac73954eac5795a72b560605edac7046fc2a6d4838e3b9b4263fe601 at /home/nixos/.local/share/containers/storage/overlay-containers/03769c3dac73954eac5795a72b560605edac7046fc2a6d4838e3b9b4263fe601/userdata/config.json"
time="2023-10-18T16:07:08Z" level=debug msg="/nix/store/1y223352pm0laijhsn7lca66qbnrg78h-conmon-2.1.7/bin/conmon messages will be logged to syslog"
time="2023-10-18T16:07:08Z" level=debug msg="running conmon: /nix/store/1y223352pm0laijhsn7lca66qbnrg78h-conmon-2.1.7/bin/conmon" args="[--api-version 1 -c 03769c3dac73954eac5795a72b560605edac7046fc2a6d4838e3b9b4263fe601 -u 03769c3dac73954eac5795a72b560605edac7046fc2a6d4838e3b9b4263fe601 -r /nix/store/54sfyqak7sn57z9gdhx4dydj7564lh6b-crun-1.8.4/bin/crun -b /home/nixos/.local/share/containers/storage/overlay-containers/03769c3dac73954eac5795a72b560605edac7046fc2a6d4838e3b9b4263fe601/userdata -p /tmp/containers-user-1000/containers/overlay-containers/03769c3dac73954eac5795a72b560605edac7046fc2a6d4838e3b9b4263fe601/userdata/pidfile -n c1 --exit-dir /tmp/podman-run-1000/libpod/tmp/exits --full-attach -s -l journald --log-level debug --syslog --conmon-pidfile /tmp/containers-user-1000/containers/overlay-containers/03769c3dac73954eac5795a72b560605edac7046fc2a6d4838e3b9b4263fe601/userdata/conmon.pid --exit-command /nix/store/z56sdzjn2s70204pc66bspdp3xi21pfr-podman-4.5.0/bin/.podman-wrapped --exit-command-arg --root --exit-command-arg /home/nixos/.local/share/containers/storage --exit-command-arg --runroot --exit-command-arg /tmp/containers-user-1000/containers --exit-command-arg --log-level --exit-command-arg debug --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /tmp/podman-run-1000/libpod/tmp --exit-command-arg --network-config-dir --exit-command-arg --exit-command-arg --network-backend --exit-command-arg netavark --exit-command-arg --volumepath --exit-command-arg /home/nixos/.local/share/containers/storage/volumes --exit-command-arg --db-backend --exit-command-arg boltdb --exit-command-arg --transient-store=false --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg --syslog --exit-command-arg container --exit-command-arg cleanup --exit-command-arg --rm --exit-command-arg 03769c3dac73954eac5795a72b560605edac7046fc2a6d4838e3b9b4263fe601]"
time="2023-10-18T16:07:08Z" level=info msg="Running conmon under slice user.slice and unitName libpod-conmon-03769c3dac73954eac5795a72b560605edac7046fc2a6d4838e3b9b4263fe601.scope"
[conmon:d]: failed to write to /proc/self/oom_score_adj: Permission denied
time="2023-10-18T16:07:08Z" level=debug msg="Received: 1735264"
time="2023-10-18T16:07:08Z" level=info msg="Got Conmon PID as 1735262"
time="2023-10-18T16:07:08Z" level=debug msg="Created container 03769c3dac73954eac5795a72b560605edac7046fc2a6d4838e3b9b4263fe601 in OCI runtime"
time="2023-10-18T16:07:08Z" level=debug msg="Attaching to container 03769c3dac73954eac5795a72b560605edac7046fc2a6d4838e3b9b4263fe601"
time="2023-10-18T16:07:08Z" level=debug msg="Starting container 03769c3dac73954eac5795a72b560605edac7046fc2a6d4838e3b9b4263fe601 with command [sleep 1h]"
time="2023-10-18T16:07:08Z" level=debug msg="Started container 03769c3dac73954eac5795a72b560605edac7046fc2a6d4838e3b9b4263fe601"
time="2023-10-18T16:07:08Z" level=debug msg="Notify sent successfully"
Also add --log-level debug to all the podman commands, there must be a Cleaning up rootless network namespace message somewhere if it removed the rootless-netns but that should never trigger if at least one container is still running.
Sorry, I missed some of the details on this. If anything is doing any removal, then hopefully it's the command after the kill, because not much seems to show up after the first one. I'll take a look.
The only other processes that manage this are the lingering ?podman pause? - which is what does the background management? I think that's what we would need to look at? -but I don't know how to monitor that one.
The log says a) it created the rootless-netns and then we see the netavark setup but then b) I never see the log line were it says starting aardvark-dns. That means that aardvark-dns was still running and this kept running in old invalid namesapce. The is likely the result of incorrect aarvark-dns config files. Once you contianers are stopped aardvark-dns should exit on its own. This will however not happen if there still entries in the config file.
So you should make sure when all the contianers are stopped that aardvark-dns exists, if does not remove everything in
/run/user/$UID/containers/networks/aardvark-dns/
or I guess in your case /tmp/containers-user-1000/networks/aardvark-dns/
then kill it that should make it work again.
Yes that is correct, an aardvark process remains running. I thought this was deliberate.
$ podman stop c1
WARN[0010] StopSignal SIGTERM failed to stop container c1 in 10 seconds, resorting to SIGKILL
c1
$ podman ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
$ ps aux | grep aardvark
...
nixos 1732299 0.0 0.0 551560 932 pts/53 S+ 16:06 0:00 journalctl --user -f SYSLOG_IDENTIFIER=aardvark-dns
...
nixos 1804003 0.0 0.0 614260 1840 ? Ssl 16:27 0:00 /nix/store/z56sdzjn2s70204pc66bspdp3xi21pfr-podman-4.5.0/libexec/podman/aardvark-dns --config /tmp/containers-user-1000/containers/networks/aardvark-dns -p 53 run
nixos 1822545 0.0 0.0 6476 2380 pts/51 S+ 16:33 0:00 grep aardvark
$ ls /tmp/containers-user-1000/containers/networks/aardvark-dns
aardvark.pid infra
$ podman network inspect infra
[
{
"name": "infra",
"id": "5e631c60894333a590d08a3b8a340bcd20b2b675e6e0809dc3b99d8ce35e82bd",
"driver": "bridge",
"network_interface": "podman2",
"created": "2023-09-30T07:26:21.117031605Z",
"subnets": [
{
"subnet": "10.89.1.0/24",
"gateway": "10.89.1.1"
}
],
"ipv6_enabled": false,
"internal": false,
"dns_enabled": true,
"labels": {
"com.docker.compose.project": "infra",
"io.podman.compose.project": "infra"
},
"ipam_options": {
"driver": "host-local"
}
}
]
$ podman network inspect testnet
[
{
"name": "testnet",
"id": "3f22fc652a2f266448c85b05d135794847214851d95934ad60b64caa548c66c8",
"driver": "bridge",
"network_interface": "podman3",
"created": "2023-10-18T14:17:38.333271696Z",
"subnets": [
{
"subnet": "10.89.2.0/24",
"gateway": "10.89.2.1"
}
],
"ipv6_enabled": false,
"internal": false,
"dns_enabled": true,
"ipam_options": {
"driver": "host-local"
}
}
]
infra is the network I was originally having issues with.
I moved the aardvark config directory to keep its contents (pointless I suppose, since I ended up killing the process anyway.), and then killed any aardvark processes (not unexpectedly expected, not killing the process resulted in a second aardvark running, probably since the old PID file wasn't there anymore.).
Now after stopping all containers the lingering aardvark process seems to go away.
I don't see anything suspicious in what I pasted regarding the broken aardvark configuration. Can you see what the issue was? It was odd that it only had the config for infra in it, and not at least testnet, but then I would have expected that it doesn't cause issues in infra, which was my original problem.
Regarding the dns configurations, they look correct to me,and re: when entering the aardvark namespace I was able to properly query dns. So I expect you mean something regarding the namespace configuration?
The contents of the old aardvark pid just in case that was somehow the problem, since I think I may or may not have pasted enough process information to see if this is wrong:
$ cat /tmp/containers-user-1000/containers/networks/aardvark-dns/aardvark.pid
1804003
The contents of the PID file look wrong regarding both the previous pasted systemd log and the lsns -T output:
├─4026532661 net 1 4065741 nixos unassigned /nix/store/z56sdzjn2s70204pc66bspdp3xi21pfr-podman-4.5.0/libexec/podman/aardvark-dns --config /tmp/containers-user-1000/containers/
...
Oct 18 14:31:49 nixos aardvark-dns[4065741]: Unable to start server unable to start CoreDns server: Cannot assign requested address (os error 99)
The PID file contains 1804003 and these logs both contain 4065741.
Do you think this is why an invalid namespace was being used? I understand that a lot of namespace referencing happens through the ns subdirectory of processes in /proc, so if the aardvark PID file is used for namespace management, and the PIDis wrong, it would make sense that there are problems there.
I knew I would shoot myself in the foot with this. 1804003 was in fact a valid PID at some point in my scroll.
I don't understand why, but I think I'm able to reproduce the issue by copying the presumably broken configuration directory's contents back.
I think I figured out a small bit. The network file in the aardvark config gets removed - presumably when the last user of said network is stopped. If there remains a configuration file then the process does not stop. This would for example be the "infra" file that was still hanging around in there even though nothing was using the infra network. If I remove the lingering config file, aardvark will stop / stops when (presumably) the last container goes down.
I have some questions but this seems to be the proximate cause, so I figure I should get this out there.
On a didactic note: This however means that aardvark is supposed to linger as long as any containers on any network are still running, so I'm not sure why aardvark not shutting down is a problem, unless you were specifically referring to my case where it stayed up even though no containers were running in which case, OK.
My question then is, if its fine for aardvark to linger after all containers in a given network are gone (but there are still containers running on another network), why exactly is it a problem if it lingers after all containers everywhere are gone? (My guess is you mean that there ends up being a namespace with no processes in it, and this somehow causes a problem? But how can I show that?)
That means that aardvark-dns was still running and this kept running in old invalid namesapce Why does the above make the namespace invalid?
How can I show that the problem is what you suspect - that aardvark / something is trying to use invalid namespaces?
Sidenote: Is the CoreDNS error a / the real problem? If it is, I managed to find it reproduced at https://access.redhat.com/solutions/7008392 , though I don't have an account so it's secondhand information that this has the same workaround as I found.
Aardvark-dns leaking means at some point prior to seeing these problem the network cleanup failed for some container.
The sequence is -> failed network cleanup (aardvark config file stays around) -> aardvark-dns keep running in its original namespace -> podman cleanup logic sees that all containers are exited and removed the rootless-netns -> next container start has no rootless-netns so we create a new one -> netavark setup find aardvark pid and sees it is running, so we should send SIGHUP to make read the new conf values. Now because aardvark-dns still runs in the old now unused netns it cannot communicate with the new containers create in a different netns. Because the interfaces are created in the new netns aardvark-dns will log the bind error as it cannot bind to a non existing bridge.
This all works as I would expect, the only problem was the failed netns cleanup which did not correctly remove the aardvark-dns config entry for the container. Unless you can reproduce this exact failure there is not match that can be done. The only thing we can do is pacth netavark to compare namespaces and create an error to just kill and restart aardvark-dns.
A friendly reminder that this issue had no activity for 30 days.
ENOTIME for two months due to exams.
Issue Description
Sorry for the vague title, It can be made specific if we can figure out something more concrete. I spend a lot of time in the past few weeks fiddling with various podman problems and gotten a bit tired and burned out on it. It would be nice to have some guidance on how to go forward from here. Sorry for the low quality issue filing, if I have some more energy later I will improve it.
I last tried this with podman 4.5.0 .
I've thrown nsenter, pwru, some iptables tracing tools, looking at the debug output of podman at this and figured some things out.
I have a fairly vanilla rootless podman-compose configuration along with a network configured for it, and I'm having DNS issues with it: the containers cannot reach aardvark-dns, the port is not there. DNS is enabled on the network.
Basically I have confirmed that the reason that DNS replies are not received is because the packet are either going to the wrong place, or aardvark isnt listening on an interface that it should be listening on. aardvark itself works, because I was able to enter it's namespace and talk to it.
I managed to figure out a workaround which consists of bringing up the containers, killing aardark-dns and then restarting any container to bring aardvark-dns back up.
Using one of (I don't currently remember, I will test it later I guess) podman unshare --rootless-netns or using nsenter to attach to the network namespace of the aardvark process, or something like that, shows that the interfaces used to wire up podman's networking are missing until I kill and start aardvark.
Is there any way to debug aardvark bringup specifically?
podman info output