Azure / azure-linux-extensions

Linux Virtual Machine Extensions for Azure
Apache License 2.0
304 stars 253 forks source link

Forced IPv6 DNS resolution even if IPv6 fully disabled #1246

Open marinnedea opened 3 years ago

marinnedea commented 3 years ago

The scenario is:

Barracuda image, which requires the IP to be set to static at appliance level and disable DHCP. In Azure Portal, just to avoid any issue, the same IP is configured as static, although, since DHCP is disabled at OS level, will influence in no way the OS side.

IPv6 completely disabled also.

Important note: Barracuda relies on a chrooted environment for waagent, which will prevent the waagent to get access to the /etc/resolv.conf file directly (this is already subject to change on Barracuda side)

The problem:

When running any extension that requires downloading a script (custom script extension, run-command invoke - if in the command we include any URL for any reason) and therefore a DNS resolution, will fail with the following error:

time=2020-10-27T11:14:40Z version=v2.1.3/git@4cd2b9f-clean operation=enable seq=1 file=0 event="download start"
time=2020-10-27T11:14:40Z version=v2.1.3/git@4cd2b9f-clean operation=enable seq=1 file=0 retry=0 error="http request failed: Get [REDACTED] dial tcp: lookup barracudanm1.blob.core.windows.net on [::1]:53: dial udp [::1]:53: socket: address family not supported by protocol"
time=2020-10-27T11:14:40Z version=v2.1.3/git@4cd2b9f-clean operation=enable seq=1 file=0 retry=0 sleep=3s
time=2020-10-27T11:14:43Z version=v2.1.3/git@4cd2b9f-clean operation=enable seq=1 file=0 retry=1 error="http request failed: Get [REDACTED] dial tcp: lookup barracudanm1.blob.core.windows.net on [::1]:53: dial udp [::1]:53: socket: address family not supported by protocol"
time=2020-10-27T11:14:43Z version=v2.1.3/git@4cd2b9f-clean operation=enable seq=1 file=0 retry=1 sleep=6s
time=2020-10-27T11:14:49Z version=v2.1.3/git@4cd2b9f-clean operation=enable seq=1 file=0 retry=2 error="http request failed: Get [REDACTED] dial tcp: lookup barracudanm1.blob.core.windows.net on [::1]:53: dial udp [::1]:53: socket: address family not supported by protocol"
time=2020-10-27T11:14:49Z version=v2.1.3/git@4cd2b9f-clean operation=enable seq=1 file=0 retry=2 sleep=12s
time=2020-10-27T11:15:01Z version=v2.1.3/git@4cd2b9f-clean operation=enable seq=1 file=0 retry=3 error="http request failed: Get [REDACTED] dial tcp: lookup barracudanm1.blob.core.windows.net on [::1]:53: dial udp [::1]:53: socket: address family not supported by protocol"
time=2020-10-27T11:15:01Z version=v2.1.3/git@4cd2b9f-clean operation=enable seq=1 file=0 retry=3 sleep=24s
time=2020-10-27T11:15:25Z version=v2.1.3/git@4cd2b9f-clean operation=enable seq=1 file=0 retry=4 error="http request failed: Get [REDACTED] dial tcp: lookup barracudanm1.blob.core.windows.net on [::1]:53: dial udp [::1]:53: socket: address family not supported by protocol"
time=2020-10-27T11:15:25Z version=v2.1.3/git@4cd2b9f-clean operation=enable seq=1 file=0 retry=4 sleep=48s
time=2020-10-27T11:16:13Z version=v2.1.3/git@4cd2b9f-clean operation=enable seq=1 file=0 retry=5 error="http request failed: Get [REDACTED] dial tcp: lookup barracudanm1.blob.core.windows.net on [::1]:53: dial udp [::1]:53: socket: address family not supported by protocol"
time=2020-10-27T11:16:13Z version=v2.1.3/git@4cd2b9f-clean operation=enable seq=1 file=0 retry=5 sleep=1m36s
time=2020-10-27T11:17:49Z version=v2.1.3/git@4cd2b9f-clean operation=enable seq=1 file=0 retry=6 error="http request failed: Get [REDACTED] dial tcp: lookup barracudanm1.blob.core.windows.net on [::1]:53: dial udp [::1]:53: socket: address family not supported by protocol"
time=2020-10-27T11:17:49Z version=v2.1.3/git@4cd2b9f-clean operation=enable seq=1 file=0 event="download failed" error="failed to download file: http request failed: Get [REDACTED] dial tcp: lookup barracudanm1.blob.core.windows.net on [::1]:53: dial udp [::1]:53: socket: address family not supported by protocol"
time=2020-10-27T11:17:49Z version=v2.1.3/git@4cd2b9f-clean operation=enable seq=1 event="failed to handle" error="processing file downloads failed: failed to download file[0]: failed to download file: http request failed: Get [REDACTED] dial tcp: lookup barracudanm1.blob.core.windows.net on [::1]:53: dial udp [::1]:53: socket: address family not supported by protocol"

The problem is, for some reason, the WaLinuxAgent tries to download the file and the DNS resolver switches to IPv6, when there's no IPv6 enabled and the IPv4 resolv.conf file is missing/inaccessible (!?) See on [::1]:53: dial udp [::1]:53: socket: address family not supported by protocol part of the errors received. Normally, it should just trigger some error about unable to resolve the DNS, or that there's no DNS server configured.. or anything else a bit more meaningful.

Found the following https://access.redhat.com/solutions/15863 (requires RedHat account to access it). Essentially, the above says:

Applications like ssh and telnet use the getaddrinfo() function with AF_UNSPEC and this function invokes both AAAA (ipv6) and A (ipv4) lookups one after the other. This can delay the connection time when DNS servers block or don't handle IPV6 correctly. Most application that are part of Red Hat Enterprise Linux offer a configuration option to disable IPv6 (or IPv4 for that matter) completely. It is advisable that any third-party application provides similar solutions. getaddrinfo() can specify if IPv4, IPv6 or both should be used as explained in man getaddrinfo:

 ai_family  This field specifies the desired  address  family  for  the
         returned  addresses.   Valid  values for this field include
         AF_INET and AF_INET6.  The value AF_UNSPEC  indicates  that
         getaddrinfo()   should  return  socket  addresses  for  any
         address family (either IPv4 or IPv6, for example) that  can
         be used with node and service.

getaddrinfo will perform IPv4 and IPv6 lookups when using AF_UNSET.

The reason this is not disabled by default in RedHat is due to a conflict between the RFC and the requirement for IPv4-only lookups.

RedHat also provides a library that will unset LD_PRELOAD=libwgetaddrinfo.so, but with the mention this is an unsupported solution because of the RFC conflict mentioned above.

Considering the above, this should be implemented at WaLinuxAgent level or in the extensions downloading files, as per RedHat advise. Also, please take in consideration adding INFO/WARNING/ERROR messages in the extensions handlers logs if DNS fails on IPv4, and also lower time-outs.

Currently, if you unset LD_PRELOAD=libwgetaddrinfo.so, the WaLinuxAgent keeps trying to query the IPv4 DNS for 90 minutes, until the extension deployments times out, which is not OK.

marinnedea commented 3 years ago

I was able to reproduce the problem on all RedHat/CentOS by simply disabling IPv6 and removing the IPv4 DNS nameservers from /etc/resolv.conf:

Steps to reproduce: Append below lines in /etc/sysctl.conf:

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1

and then just run:

sudo sysctl -p

Remove IPv4 entries in /etc/resolv.conf (no need to backup, a simple "systemctl restart network" will restore the file) sudo echo "" > /etc/resolv.conf

At this point, try to run CustomScriptForLinux Extension.