balena-os / balena-engine

Moby-based Container Engine for Embedded, IoT, and Edge uses
https://www.balena.io
Apache License 2.0
694 stars 66 forks source link

Error listening to events: Error: connect ECONNREFUSED /var/run/balena-engine.sock #227

Open jellyfish-bot opened 4 years ago

jellyfish-bot commented 4 years ago

[thgreasi] The supervisor of a RPi1 on 2.48.0+rev1 stopped working at some point. Running balena ps was erroring with

Cannot connect to the balenaEngine daemon at unix:///var/run/balena-engine.sock. Is the balenaEngine daemon running?

Running journalctl -fu balena -a -n 100 gave:

[error]   Error listening to events: Error: connect ECONNREFUSED /var/run/balena-engine.sock
[error]         at PipeConnectWrap.afterConnect [as oncomplete] (net.js:1097:14) Error: connect ECONNREFUSED /var/run/balena-engine.sock
[error]       at PipeConnectWrap.afterConnect [as oncomplete] (net.js:1097:14)

while journalctl -f -n 100 -u resin-supervisor gave:

systemd[1]: resin-supervisor.service: Start-pre operation timed out. Terminating.
systemd[1]: resin-supervisor.service: Control process exited, code=killed, status=15/TERM
resin-supervisor[4974]: deactivating
systemd[1]: resin-supervisor.service: Control process exited, code=exited, status=3/NOTIMPLEMENTED
systemd[1]: resin-supervisor.service: Failed with result 'timeout'.
systemd[1]: Failed to start Balena supervisor.

Confirmed with stat /var/run/balena-engine.sock that it's indeed a socket and not a directory.

jellyfish-bot commented 4 years ago

[thgreasi] This issue has attached support thread https://jel.ly.fish/50a3f6f0-e0ae-49f7-ad89-34a517e26a8c

dt-rush commented 4 years ago

Updates from the thread's status hashtag about the possible cause:

systemd tried to shut down the balena engine (probably through a user command) but that timed out and systemd entered a state where it doesn't consider the engine to be running (even though it is) and tries to start a new instance anyway. This fails, but the engine's socket is probably replaced anyway, putting the whole device in an uncontrollable state. Asked the user for more details, but communicated that the best way out is a power cycle to break this stalemate.

At one point Florin was able to get the device working by restarting the balena services and killing the older stalled processes but after a quick while the device turned bad again. Permission was given to reboot. A reboot was initiated and the device appears to be functioning normally so far.