balena-io-modules / device-diagnostics

on-device diagnostics tool
9 stars 8 forks source link

Diagnostics check_networking test fails on IPv6 only networks. #298

Closed ab77 closed 2 years ago

ab77 commented 2 years ago

Expected Behaviour

Diagnostics check_networking passes on IPv6 only networks by attempting to resolve using public IPv6 DNS (e.g. https://developers.google.com/speed/public-dns/docs/using) instead of hard-coding IPv4 IPs.

Actual Behaviour

Diagnostics check_networking test fails on IPv6 only networks.

image

Steps to Reproduce the Problem

  1. block IPv4 DHCP traffic (67-68/udp) on a dual IPv4/IPv6 network with public access on both address families
  2. provision a device
  3. observe actual behaviour

References

https://jel.ly.fish/brainstorm-topic-c3f8670a-8981-42ae-bcbf-1f57e959122d

klutchell commented 2 years ago

This is only somewhat related to the diagnostic script.

When testing DNS resolution we read the upstream servers directly from dnsmasq on the host, and by default in balenaOS we include 8.8.8.8 in the list of IPv4 DNS servers. https://www.balena.io/docs/reference/OS/configuration/#dnsservers

So in my mind the correct solution is to remove 8.8.8.8 from the server list on balenaOS if IPv4 networking is not available. This can be done by setting dnsServers="null in config.json.

OR, we change the diagnostics to report both successful and failed DNS servers so it's clear that the IPv6 servers from DHCP are working.

OR, we change the diagnostics to only report a failure if none of the servers can resolve, but this is removing some helpful context IMO.

klutchell commented 2 years ago

Agreed on an arch call brainstorm to change the wording to indicate that some of the servers worked, and only some of the servers failed. Maybe raise a warning instead of an error in this case.

klutchell commented 2 years ago

I tried printing both successful and failed DNS endpoints, and I think it makes the result look cluttered.

Some networking issues detected: 
test_upstream_dns: DNS lookup succeeded for 0.resinio.pool.ntp.org via upstream: 192.168.1.55
test_upstream_dns: DNS lookup succeeded for api.6fd7efae1b8505c1a8b47da01934f844.bob.local via upstream: 192.168.1.55
test_upstream_dns: DNS lookup failed for 0.resinio.pool.ntp.org via upstream: 8.8.8.8
test_upstream_dns: DNS lookup succeeded for api.6fd7efae1b8505c1a8b47da01934f844.bob.local via upstream: 8.8.8.8

image

It's not immediately obvious to me that 3/4 of those results are good.

@myarmolinsky @chrisys Any suggestions here or should I bring it to a product call? We want to indicate in a clean way that some of the configured upstream DNS servers failed (but not necessarily all of them).

The current test only prints the failed ones and is unclear that the other servers are working fine. The above test prints both successful and failed endpoints but it pretty cluttered.

I might try to break each DNS endpoint into it's own pass/fail test like we do for user app healthchecks?

myarmolinsky commented 2 years ago

@klutchell Generally when I see an error I don't assume everything is broken, I just start investigating that individual error. So unless you split up the health check into multiple health checks to display successes and errors, I don't think it is necessary to tell the user which ones were successful. Perhaps that's just me though, definitely interested in @chrisys 's thoughts

klutchell commented 2 years ago

@myarmolinsky I think I agree with you. In fact, clicking on the failed test will link you to the documentation describing the test, where I can make it more clear that successful servers are not shown. https://www.balena.io/docs/reference/diagnostics/#test_upstream_dns