Closed amd-isaac closed 10 months ago
Could you please paste the version of cni plugins being used as well ? I am not sure if cni
is recommened for this version of aardvark/netavark
. @Luap99 Could confirm this better.
Yes as far as upstream is concerned CNI support is deprecated and dnsname is basically EOL.
Please test with netavark + aardvark-dns instead. Although I assume this is a simple race condition if you add a sleep 1
before the curl does it work better?
@flouthoc here are the versions:
# /usr/libexec/cni/bridge --help
CNI bridge plugin version unknown
CNI protocol versions supported: 0.1.0, 0.2.0, 0.3.0, 0.3.1, 0.4.0, 1.0.0
# /usr/libexec/cni/dnsname --help
CNI dnsname plugin
version: 1.4.0-dev
commit: 6685f68dbc13a95b73b9394b304927c6f518021c
CNI protocol versions supported: 0.1.0, 0.2.0, 0.3.0, 0.3.1, 0.4.0, 1.0.0
@Luap99 adding the sleep 1
before the curl does not fix the issue. Running the command manually within the container multiple times without restarting the container shows the same random failure. I also tried switching to the netavark backend and still see the same failure (running with netavark 1.5.1 and aardvark-dns 1.5.0).
should this be taken to bugzilla?
I think getting this into the RHEL channels would help getting prioritization.
In any case if this happens for both CNI and netavark then I would not think it is a problem in the dns server itself. Did you try to capture the packages and see where they get lost?
@Luap99 here's a tcpdump of a successful (first trace) and unsuccessful (second trace) from the perspective of the sender.
Successful connection:
20:52:04.835352 IP c3fbc69897c9.51345 > host.containers.internal.domain: 49549+ A? receiver.dns.podman. (37)
20:52:04.835385 IP c3fbc69897c9.51345 > host.containers.internal.domain: 43393+ AAAA? receiver.dns.podman. (37)
20:52:04.835488 IP host.containers.internal.domain > c3fbc69897c9.51345: 49549 1/0/0 A 10.89.0.46 (53)
20:52:04.835506 IP host.containers.internal.domain > c3fbc69897c9.51345: 43393 0/0/0 (37)
20:52:04.835665 IP c3fbc69897c9.35880 > 10.89.0.46.http: Flags [S], seq 3431928952, win 29200, options [mss 1460,sackOK,TS val 1511773294 ecr 0,nop,wscale 8], length 0
20:52:04.835688 IP 10.89.0.46.http > c3fbc69897c9.35880: Flags [S.], seq 1530778552, ack 3431928953, win 28960, options [mss 1460,sackOK,TS val 2943561629 ecr 1511773294,nop,wscale 8], length 0
20:52:04.835694 IP c3fbc69897c9.35880 > 10.89.0.46.http: Flags [.], ack 1, win 115, options [nop,nop,TS val 1511773294 ecr 2943561629], length 0
20:52:04.835715 IP c3fbc69897c9.35880 > 10.89.0.46.http: Flags [P.], seq 1:73, ack 1, win 115, options [nop,nop,TS val 1511773294 ecr 2943561629], length 72: HTTP: GET / HTTP/1.1
20:52:04.835735 IP 10.89.0.46.http > c3fbc69897c9.35880: Flags [.], ack 73, win 114, options [nop,nop,TS val 2943561629 ecr 1511773294], length 0
20:52:04.835918 IP 10.89.0.46.http > c3fbc69897c9.35880: Flags [P.], seq 1:271, ack 73, win 114, options [nop,nop,TS val 2943561630 ecr 1511773294], length 270: HTTP: HTTP/1.1 200 OK
20:52:04.835927 IP c3fbc69897c9.35880 > 10.89.0.46.http: Flags [.], ack 271, win 119, options [nop,nop,TS val 1511773295 ecr 2943561630], length 0
20:52:04.835984 IP c3fbc69897c9.35880 > 10.89.0.46.http: Flags [F.], seq 73, ack 271, win 119, options [nop,nop,TS val 1511773295 ecr 2943561630], length 0
20:52:04.836565 IP 10.89.0.46.http > c3fbc69897c9.35880: Flags [F.], seq 271, ack 74, win 114, options [nop,nop,TS val 2943561630 ecr 1511773295], length 0
20:52:04.836570 IP c3fbc69897c9.35880 > 10.89.0.46.http: Flags [.], ack 272, win 119, options [nop,nop,TS val 1511773295 ecr 2943561630], length 0
Could not resolve host:
20:52:09.140300 IP c3fbc69897c9.54750 > atldns02.amd.com.domain: 64723+ A? receiver.dns.podman. (37)
20:52:09.140349 IP c3fbc69897c9.54750 > atldns02.amd.com.domain: 29399+ AAAA? receiver.dns.podman. (37)
20:52:09.140802 IP atldns02.amd.com.domain > c3fbc69897c9.54750: 29399 NXDomain 0/1/0 (112)
20:52:09.140805 IP atldns02.amd.com.domain > c3fbc69897c9.54750: 64723 NXDomain 0/1/0 (112)
20:52:09.140858 IP c3fbc69897c9.52989 > host.containers.internal.domain: 64669+ A? receiver.amd.com. (34)
20:52:09.140883 IP c3fbc69897c9.52989 > host.containers.internal.domain: 52378+ AAAA? receiver.amd.com. (34)
20:52:09.141498 IP host.containers.internal.domain > c3fbc69897c9.52989: 64669 NXDomain* 0/1/0 (101)
20:52:09.141509 IP host.containers.internal.domain > c3fbc69897c9.52989: 52378 NXDomain* 0/1/0 (101)
20:52:09.141537 IP c3fbc69897c9.47764 > atldns01.amd.com.domain: 40416+ A? receiver. (26)
20:52:09.141556 IP c3fbc69897c9.47764 > atldns01.amd.com.domain: 37857+ AAAA? receiver. (26)
20:52:09.141910 IP atldns01.amd.com.domain > c3fbc69897c9.47764: 40416 NXDomain 0/1/0 (101)
20:52:09.141913 IP atldns01.amd.com.domain > c3fbc69897c9.47764: 37857 NXDomain 0/1/0 (101)
20:52:10.287892 ARP, Request who-has 10.89.0.46 tell c3fbc69897c9, length 28
20:52:10.287892 ARP, Request who-has host.containers.internal tell c3fbc69897c9, length 28
20:52:10.287885 ARP, Request who-has c3fbc69897c9 tell host.containers.internal, length 28
20:52:10.287908 ARP, Request who-has c3fbc69897c9 tell 10.89.0.46, length 28
20:52:10.287911 ARP, Reply c3fbc69897c9 is-at ae:b9:59:5b:e1:8e (oui Unknown), length 28
20:52:10.287912 ARP, Reply c3fbc69897c9 is-at ae:b9:59:5b:e1:8e (oui Unknown), length 28
20:52:10.287915 ARP, Reply host.containers.internal is-at 96:b9:d0:15:d4:97 (oui Unknown), length 28
20:52:10.287917 ARP, Reply 10.89.0.46 is-at f6:5c:9a:66:f2:56 (oui Unknown), length 28
After debugging this issue further, it appears to be similar to #19515. It involves the host's /etc/resolv.conf
settings being imported into the container's /etc/resolv.conf
file; the rotate
option in conjunction with the additional nameservers leading to inconsistent DNS resolution failures for inter-container communication. I am currently using a workaround involving the container editing its own resolv.conf
file's contents with sed
to remove the offending rotate
option.
I have been unable to find a podman
-specific solution to this problem with version 4.4.1; does this version have a way to prevent podman
from importing the host's resolv.conf
settings?
If you volume mount in your own /etc/resolv.conf?
We allow --dns-search=.
to unset the search domains but --dns-option=.
does not work currently so I feel like this is something we can do to allow removing options.
However what you describe should only apply to CNI and not netavark. With netavark it should only add the aardvark-dns ips in resolv.conf so even with the rotate options in resolv.conf it should just work.
After further testing, it appears that with podman version 4.4.1, both the netavark and CNI backends have this issue. However, with version 4.6.1 the netavark backend does not have this issue (the CNI backend still does).
Presumably something changed between 4.4.1 and 4.6.1 that fixed this issue when running with netavark, as @Luap99 indicated. I would be interested in testing a --dns-option=.
flag in the future, but for now I've been able to work around the issue.
Issue Description
Running with RHEL 8.8 and Podman 4.4.1 using CNI network backend with dnsname installed. When containers that share the same network attempt to communicate with each other by name, the results are inconsistent.
Steps to reproduce the issue
Steps to reproduce the issue:
podman network create testnet
podman run -dt --net testnet --name receiver docker.io/library/httpd
podman run -it --rm --net testnet --name sender ubi8/ubi curl receiver:80
20 timesDescribe the results you received
Describe the results you expected
Expected all attempts by
sender
to communicate withreceiver
to succeed.podman info output
Podman in a container
No
Privileged Or Rootless
None
Upstream Latest Release
Yes
Additional environment details
None
Additional information
Exact order and number of successes/failures seems random, rerunning 20 more times will not show the same pattern of successes/failures. Behavior also seen on a separate RHEL 8.7 system also running Podman 4.4.1 with same network backend configuration.