balena-os / balena-supervisor

Balena Supervisor: balena's agent on devices.
https://balena.io
Other
148 stars 63 forks source link

Supervisor fails to resolve DNS on v4, v5 in offline/air-gapped setup using open-balena #2237

Open compiaffe opened 8 months ago

compiaffe commented 8 months ago

We deploy open-balena to an air-gapped network where the router resolves all the required balena domains: e.g. api.aivero.lan and advertises that DNS server via DHCP.

We balena os configure RaspberryPi3 with balenaOS v2.80.3 and these connect nicely even in an air-gapped network.

However, these old images don’t have the fixed/updated HQ camera sensor-mode 5 1 so we need a newer version.

However, the newest v5.0.8, or v2.115.18+rev2 versions do not connect to open balena. The supervisor cannot resolve the domain. The HostOS however does.

The supervisors errors with getaddrinfo EAI_AGAIN api.aivero.lan:

root@9dc1123:~# balena ps
CONTAINER ID   IMAGE                                                            COMMAND                  CREATED          STATUS                             PORTS     NAMES
c699ff174f56   registry2.balena-cloud.com/v2/c5636e5430e2762232e60e19e79c773f   "/usr/src/app/entry.…"   49 seconds ago   Up 41 seconds (health: starting)             balena_supervisor
root@9dc1123:~# balena logs c699ff174f56 -f
INFO: Found device /dev/mmcblk0p1 on current boot device mmcblk0, using as mount for '(resin|balena)-boot'.
INFO: Found device /dev/mmcblk0p5 on current boot device mmcblk0, using as mount for '(resin|balena)-state'.
INFO: Found device /dev/mmcblk0p6 on current boot device mmcblk0, using as mount for '(resin|balena)-data'.
find: /mnt/root/tmp/balena-supervisor/services: No such file or directory
[info]    Supervisor v15.0.4 starting up...
[info]    Setting host to discoverable
[debug]   Starting systemd unit: avahi-daemon.service
[debug]   Starting systemd unit: avahi-daemon.socket
[debug]   Starting logging infrastructure
[info]    Starting firewall
[warn]    Invalid firewall mode: . Reverting to state: off
[info]    Applying firewall mode: off
[success] Firewall mode applied
[debug]   Starting api binder
[debug]   Performing database cleanup for container log timestamps
[info]    Previous engine snapshot was not stored. Skipping cleanup.
[debug]   Handling of local mode switch is completed
(node:1) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
(Use `node --trace-deprecation ...` to show where the warning was created)
[info]    API Binder bound to: https://api.aivero.lan/v6/
[event]   Event: Supervisor start {}
[info]    Starting API server
[info]    Supervisor API successfully started on port 48484
[debug]   Ensuring device is provisioned
[debug]   Connectivity check enabled: true
[debug]   Starting periodic check for IP addresses
[event]   Event: Device bootstrap {}
[info]    Waiting for connectivity...
[info]    VPN connection is not active.
[info]    New device detected. Provisioning...
[success] Initialised splash image backend
[info]    Reporting initial state, supervisor version and API info
[info]    Attempting to load any preloaded applications
[error]   LogBackend: unexpected error: Error: getaddrinfo EAI_AGAIN api.aivero.lan
[error]         at GetAddrInfoReqWrap.onlookupall [as oncomplete] (node:dns:119:26)
[event]   Event: Device bootstrap failed, retrying {"delay":30000,"error":{"cause":{},"isOperational":true,"errno":-3001,"code":"EAI_AGAIN","syscall":"getaddrinfo","hostname":"api.aivero.lan"}}
^C
root@9dc1123:~# ^C
root@9dc1123:~# ping api.aivero.lan
PING api.aivero.lan (192.168.88.243): 56 data bytes
64 bytes from 192.168.88.243: seq=0 ttl=64 time=1.528 ms
64 bytes from 192.168.88.243: seq=1 ttl=64 time=1.777 ms
^C

We also tried adding a dnsServers: "null" entry to config.json to disable the automatic injection of 8.8.8.8 into the list of DNS servers. In certain cases having 8.8.8.8 caused a timeout waiting on a response from this server which is not reachable due to our air-gapped network. However, this had no effect here.


We found that the latest openBalena version for RaspberryPi3 that has the HQ camera fix AND connects correctly is the v2.94.4

For the RaspberryPi4 we are using v2.88.4+rev0 which has both the HQ fix AND connects correctly.


There might be a connection to https://github.com/balena-os/balena-supervisor/issues/1335

How do we get the v5 version of balenaOS connecting correctly?


FYI, also posted here: https://forums.balena.io/t/supervisor-fails-to-resolve-dns-on-v4-v5-in-offline-air-gapped-setup-using-open-balena/369796

compiaffe commented 7 months ago

Any insights here?

cywang117 commented 7 months ago

1335 is quite old so it's not clear if this new issue has the same root cause, especially when the problem wasn't present in OS v2.80.3.

I wonder if trying v15.2.0 will make a difference. For reference, here is a gist I use occasionally to preload a different Supervisor version into an OS. It may be useful to you as I'm not sure if Supervisor upgrades are available in openBalena.

jmalves5 commented 4 months ago

We are still seeing this issue.

We have tried v15.2.0 (and later versions), with it's mDNS fixes but we are getting the same result.

The last supervisor version that correctly resolves DNS queries is v14.0.8. Anything after that, (starting at 14.0.13), does not resolve DNS queries.

Looking at the diffs between v14.0.8 and v14.0.13 the only change that seems kinda relevant is the removal of avahi-daemon (and respective configs) from the supervisor container image, but I don't see exactly how that could be causing the issue.

Any help? Thanks in advance

alexgg commented 4 months ago

Could you please try the latest v16.3.6 version? This was probably fixed in https://github.com/balena-os/balena-supervisor/pull/2311/commits/6f02b17968d02c2e27b523e40a25ef4c4815d20a.

jmalves5 commented 4 months ago

Thanks for the tip @alexgg we will try it and report back

jmalves5 commented 4 months ago

Even in the latest version we see the same issue inside the container:

INFO: Found device /dev/mmcblk0p1 on current boot device mmcblk0, using as mount for '(resin|balena)-boot'.
INFO: Found device /dev/mmcblk0p5 on current boot device mmcblk0, using as mount for '(resin|balena)-state'.
INFO: Found device /dev/mmcblk0p6 on current boot device mmcblk0, using as mount for '(resin|balena)-data'.
[info]    Supervisor v16.3.5 starting up...
[info]    Setting host to discoverable
[debug]   Starting systemd unit: avahi-daemon.service
[debug]   Starting systemd unit: avahi-daemon.socket
[debug]   Starting logging infrastructure
[info]    Starting firewall
[warn]    Invalid firewall mode: . Reverting to state: off
[info]    Applying firewall mode: off
[success] Firewall mode applied
[debug]   Starting api binder
[debug]   Performing database cleanup for container log timestamps
[info]    Previous engine snapshot was not stored. Skipping cleanup.
[debug]   Handling of local mode switch is completed
(node:1) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
(Use node --trace-deprecation ... to show where the warning was created)
[info]    API Binder bound to: https://api.aivero.lan/v6/
[event]   Event: Supervisor start {}
[info]    Starting API server
[info]    Supervisor API successfully started on port 48484
[debug]   Ensuring device is provisioned
[debug]   Connectivity check enabled: true
[debug]   Starting periodic check for IP addresses
[info]    Waiting for connectivity...
[event]   Event: Device bootstrap {}
[info]    VPN connection is not active.
[info]    New device detected. Provisioning...
[success] Initialised splash image backend
[info]    Reporting initial state, supervisor version and API info
[info]    Attempting to load any preloaded applications
[event]   Event: Device bootstrap failed, retrying {"delay":30000,"error":{"message":"getaddrinfo EAI_AGAIN api.aivero.lan","stack":"Error: getaddrinfo EAI_AGAIN api.aivero.lan\n    at GetAddrInfoReqWrap.onlookupall [as oncomplete] (node:dns:118:26)"}}
[event]   Event: Device bootstrap {}
[info]    New device detected. Provisioning...
[event]   Event: Device bootstrap failed, retrying {"delay":30000,"error":{"message":"getaddrinfo EAI_AGAIN api.aivero.lan","stack":"Error: getaddrinfo EAI_AGAIN api.aivero.lan\n    at GetAddrInfoReqWrap.onlookupall [as oncomplete] (node:dns:118:26)"}}

But balenaOS solves the api.aivero.lan just fine

compiaffe commented 4 months ago

@alexgg any idea how to further debug this?

alexgg commented 4 months ago

hey @compiaffe bring the issue through our support channels like the forums, and provide us with a reproduction.

compiaffe commented 4 months ago

@alexgg

We already have: https://forums.balena.io/t/supervisor-fails-to-resolve-dns-on-v4-v5-in-offline-air-gapped-setup-using-open-balena/369796

Reproduction is outstanding.