Open everhardt opened 1 year ago
By the way, these are the settings NetworkManager reports when it starts:
Feb 09 09:36:19 27eb737 NetworkManager[1358]: <info> [1675935379.4474] dns-mgr: init: dns=default,systemd-resolved rc-manager=resolvconf
IIUC balena-os does not have systemd-resolved but dnsmasq?
The contents of resolv.dnsmasq -> /var/run/resolvconf/interface/NetworkManager
in my case are:
nameserver 192.168.1.1
nameserver 8.8.8.8
nameserver 8.8.4.4
The first line is coming from the dhcp of eth0, the other two from wwan0.
I might be off on a tangent here, but I suspect that the DNS resolver uses the eth0 interface for all three name servers, even if eth0 cannot reach them. I think this could be solved if resolvconf
would instead write
nameserver 192.168.1.1@eth0
nameserver 8.8.8.8@wwan0
nameserver 8.8.4.4@wwan0
so that the DNS resolver would know which interface to use.
[mpous] This has attached https://jel.ly.fish/a235fd1d-49fb-44bb-aa4f-496d04dbb20b
Summarizing here some of the discussion we had on the other support thread. We found out that this is happening when NetworkManager's CheckConnectivity is called explicitly from a container when connection to some remote server is lost - for the purpose of regaining connectivity sooner and not relying on the check interval only. When the default behavior with relying on the connectivity check interval all is working well, although we still saw similar DNS errors.
Those are hinted about on here and reportedly the connectivity check works better when systemd-resolved
is present:
https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/57d226d3f08d8a904a554367e799c9c367032b0d
Although the per-interface connectivity check was removed later on: https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/e6dac4f0b67e5abd10e0f8a82e040d8374f607a8
We do not currently used systemd-resolved
, but instead we use dnsmasq
as a DNS forwarder and cache. Possibly if we switch to using systemd-resolved
the connectivity check will work better, but this still has to be investigated.
Workarounds can be implemented on the container side for this as well - like adjusting metrics on the secondary interface from a container if both interfaces fail. Another possibility is disabling the NM connectivity check completely and doing a custom container solution that fits specific use-cases.
@everhardt I think you're right and the workaround here is to set specific DNS resolvers to the extra interfaces using the dnsServers
entry in config.json
.
I have a Compulab IOT-gate-imx8 device and tested balenaOS 2.108.29 as it includes the fix for #2964. The good news is that connectivity checks are performed again, the bad news is that it doesn't work properly.
The device is running with both eth0 and wwan0 (also called
cdc-wdm0
) connections having internet connection (both with state "FULL" in terms of NetworkManager). If I now break the eth0 connection, I would expect that at the next NetworkManager connectivity check, it detects that eth0 has no internet connection ("LIMITED" in terms of NetworkManager) and that it then increases the route-metric of eth0 with 20000 and switches routing to wwan0.I tried this and it actually detects properly that the eth0 connection is broken, but it also thinks the wwan0 is broken, see these (filtered) journalctl logs: