Supervisor API doesn't start up and thus healthcheck never finishes

imrehg commented 5 years ago

In some cases (eg when running balenaOS in a container https://github.com/balena-os/resinos-in-container/issues/15 ) there might be cases when the kernel modules are not available, and thus when using ip6_tables the supervisor might not be capable of getting the right modules loaded.

In that case, the API might not start up properly, and the healthcheck will never finish, resulting in supervisor restart cycles

Starting system message bus: dbus.
 * Starting Avahi mDNS/DNS-SD Daemon: avahi-daemon
   ...done.
modprobe: can't change directory to '4.15.0-1044-aws': No such file or directory
(node:1) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
Starting event tracker
Starting up api binder
Starting logging infrastructure
Event: Supervisor start {}
Performing database cleanup for container log timestamps
Connectivity check enabled: true
Starting periodic check for IP addresses
Reporting initial state, supervisor version and API info
VPN status path exists.
Waiting for connectivity...
Skipping preloading
Starting API server
Applying target state
Ensuring device is provisioned
Starting current state report
Starting target state poll
Error on switching supervisor API listening rules - stopping API.
   { Error: Command failed: ip6tables -A INPUT -p tcp --dport 48484 -j DROP
modprobe: can't change directory to '4.15.0-1044-aws': No such file or directory
ip6tables v1.6.2: can't initialize ip6tables table `filter': Table does not exist (do you need to insmod?)
Perhaps ip6tables or your kernel needs to be upgraded.
    at ChildProcess.exithandler (child_process.js:294:12)
    at ChildProcess.emit (events.js:189:13)
    at maybeClose (internal/child_process.js:970:16)
    at Socket.stream.socket.on (internal/child_process.js:389:11)
    at Socket.emit (events.js:189:13)
    at Pipe._handle.close (net.js:597:12)
  cause:
   { Error: Command failed: ip6tables -A INPUT -p tcp --dport 48484 -j DROP
   modprobe: can't change directory to '4.15.0-1044-aws': No such file or directory
   ip6tables v1.6.2: can't initialize ip6tables table `filter': Table does not exist (do you need to insmod?)
   Perhaps ip6tables or your kernel needs to be upgraded.

       at ChildProcess.exithandler (child_process.js:294:12)
       at ChildProcess.emit (events.js:189:13)
       at maybeClose (internal/child_process.js:970:16)
       at Socket.stream.socket.on (internal/child_process.js:389:11)
       at Socket.emit (events.js:189:13)
       at Pipe._handle.close (net.js:597:12)
     killed: false,
     code: 3,
     signal: null,
     cmd: 'ip6tables -A INPUT -p tcp --dport 48484 -j DROP' },
  isOperational: true,
  killed: false,
  code: 3,
  signal: null,
  cmd: 'ip6tables -A INPUT -p tcp --dport 48484 -j DROP' }
Event: Service start {"service":{"appId":1485973,"serviceId":294887,"serviceName":"ipfs","releaseId":1044939}}
Event: Service started {"service":{"appId":1485973,"serviceId":294887,"serviceName":"ipfs","releaseId":1044939}}
Finished applying target state
Apply success!
Applying target state
Finished applying target state
Apply success!
Internet Connectivity: OK
.... (regular logs, until supervisor restart)

Inside the supervisor container have:

root@b039b25:~# balena exec -ti resin_supervisor /bin/sh
/usr/src/app # ps
  PID USER       VSZ STAT COMMAND
    1 root      878m S    node /usr/src/app/dist/app.js
   17 messageb  3200 S    /usr/bin/dbus-daemon --system
   23 avahi     3616 S    avahi-daemon: running [b039b25.local]
   24 avahi     3484 S    avahi-daemon: chroot helper
   56 root      4100 S    /bin/sh -c wget -qO- http://127.0.0.1:${LISTEN_PORT:-48484}/v1/healthy || exit 1
   62 root      4100 S    wget -qO- http://127.0.0.1:48484/v1/healthy
   63 root      4144 S    /bin/sh
   69 root      4144 R    ps
/usr/src/app # wget -qO- http://127.0.0.1:${LISTEN_PORT:-48484}/v1/healthy 
^C
/usr/src/app # wget -O- http://127.0.0.1:${LISTEN_PORT:-48484}/v1/healthy 
Connecting to 127.0.0.1:48484 (127.0.0.1:48484)
(hanging there)

balena-ci commented 4 years ago

[thgreasi] This issue has attached support thread https://jel.ly.fish/#/support-thread~f5630c4b-f15e-4c93-be1f-8c404dccf6fd

balena-ci commented 4 years ago

[thgreasi] This issue has attached support thread https://jel.ly.fish/#/support-thread~5e10604e-e86a-4931-84c9-3e3b972857c5

thgreasi commented 4 years ago

Saw this happening twice on balenaos-on-docker on an EC2 host, while using the balenaos-in-container scripts. See: https://github.com/balena-os/balenaos-in-container/blob/master/balenaos-in-container.sh

CameronDiver commented 4 years ago

I'm not sure what to do here, because we shouldn't really be starting up the supervisor API if we can't lock down the device. It's a little annoying that the healthcheck relies on this, perhaps we should start up the healthcheck only?

balena-os / balena-supervisor

Supervisor API doesn't start up and thus healthcheck never finishes #1098