balena-os / balena-engine

Moby-based Container Engine for Embedded, IoT, and Edge uses
https://www.balena.io
Apache License 2.0
694 stars 66 forks source link

connecting to shim: dial unix \x00/containerd-shim/moby/0a3da4c.../shim.sock: connect: connection refused #192

Open pdcastro opened 4 years ago

pdcastro commented 4 years ago

User shared a 2.44.0+rev1 Intel NUC device where balenaEngine was failing to start with errors including "dial unix \x00/containerd-shim/moby/0a3da4c.../shim.sock: connect: connection refused", among several others. See log excerpts below; full "journalctl -au balena" logs are attached to the support issue (JF link in a comment).

Restarting services was not sufficient to recover the device. The solution described in issue #186 was attempted in order to recover the device, but it did not help. (To be clear, balenaEngine was already failing to start before the solution described in #186 was attempted.)

It looks like DNS lookup was failing, and leading to the image pull failing:

Nov 11 22:31:28 121a4e5 5300528fca1c[766]: [event]   Event: Docker image download {"image":{"name":"registry2.balena-cloud.com/v2/6436a6fa63b4f0e3f9f4fe2d028047fd@sha256:efb59fb53c18a3ab7371eb94d09959c68068a1d06762b36023982a8385bcaa27","appId":1462350,"serviceId":259247,"serviceName":"db","imageId":1664701,"releaseId":1139876,"dependent":0,"dockerImageId":null}}
Nov 11 22:31:28 121a4e5 balenad[766]: time="2019-11-11T22:31:28.011603871Z" level=warning msg="Error getting v2 registry: Get https://registry2.balena-cloud.com/v2/: dial tcp: lookup registry2.balena-cloud.com on 127.0.0.2:53: server misbehaving"
Nov 11 22:31:28 121a4e5 balenad[766]: time="2019-11-11T22:31:28.011627348Z" level=info msg="Attempting next endpoint for pull after error: Get https://registry2.balena-cloud.com/v2/: dial tcp: lookup registry2.balena-cloud.com on 127.0.0.2:53: server misbehaving"
Nov 11 22:31:28 121a4e5 balenad[766]: time="2019-11-11T22:31:28.011660678Z" level=error msg="Handler for POST /images/create returned error: Get https://registry2.balena-cloud.com/v2/: dial tcp: lookup registry2.balena-cloud.com on 127.0.0.2:53: server misbehaving"

That's most likely why the application update wasn't working. What happens after that is summarised in these logs:

Nov 11 22:31:50 121a4e5 balenad[766]: time="2019-11-11T22:31:50.259685061Z" level=info msg="Container 5300528fca1ceb38bffb9da8fa15155692c9bf1e044be9cc6013f7cb03816f31 failed to exit within 10 seconds of signal 15 - using the force"
Nov 11 22:31:50 121a4e5 balenad[766]: time="2019-11-11T22:31:50.387192484Z" level=info msg="shim reaped" id=5300528fca1ceb38bffb9da8fa15155692c9bf1e044be9cc6013f7cb03816f31
Nov 11 22:31:50 121a4e5 balenad[766]: time="2019-11-11T22:31:50.396827238Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Nov 11 22:31:50 121a4e5 balenad[766]: time="2019-11-11T22:31:50.460166426Z" level=info msg="Pull session cancelled"
Nov 11 22:31:50 121a4e5 balenad[766]: time="2019-11-11T22:31:50.460242399Z" level=error msg="Not continuing with pull after error: context canceled"
Nov 11 22:31:50 121a4e5 balenad[766]: time="2019-11-11T22:31:50.460450504Z" level=warning msg="failed to download layer: \"context canceled\", retrying to read again"
Nov 11 22:31:50 121a4e5 balenad[766]: time="2019-11-11T22:31:50.972726025Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/5300528fca1ceb38bffb9da8fa15155692c9bf1e044be9cc6013f7cb03816f31/shim.sock debug=false pid=2448
Nov 11 22:31:51 121a4e5 balenad[766]: time="2019-11-11T22:31:51.010573052Z" level=warning msg="failed to download layer: \"context canceled\", retrying to read again"
Nov 11 22:31:51 121a4e5 balenad[766]: time="2019-11-11T22:31:51.020826945Z" level=info msg="shim reaped" id=5300528fca1ceb38bffb9da8fa15155692c9bf1e044be9cc6013f7cb03816f31
Nov 11 22:31:51 121a4e5 balenad[766]: time="2019-11-11T22:31:51.030349836Z" level=error msg="stream copy error: reading from a closed fifo"
Nov 11 22:31:51 121a4e5 balenad[766]: time="2019-11-11T22:31:51.030366882Z" level=error msg="stream copy error: reading from a closed fifo"
Nov 11 22:31:51 121a4e5 balenad[766]: time="2019-11-11T22:31:51.069414626Z" level=error msg="5300528fca1ceb38bffb9da8fa15155692c9bf1e044be9cc6013f7cb03816f31 cleanup: failed to delete container from containerd: no such container"
Nov 11 22:31:51 121a4e5 balenad[766]: time="2019-11-11T22:31:51.069448505Z" level=error msg="Handler for POST /v1.39/containers/5300528fca1ceb38bffb9da8fa15155692c9bf1e044be9cc6013f7cb03816f31/start returned error: OCI runtime create failed: container_linux.go:345: starting container process caused \"process_linux.go:275: applying cgroup configuration for process caused \\\"Process with ID 2462 does not exist.\\\"\": unknown"

... which eventually leads to balenEngine being SIGABRT'd by the healthdog

Nov 11 22:37:26 121a4e5 systemd[1]: balena.service: Watchdog timeout (limit 6min)!
Nov 11 22:37:26 121a4e5 systemd[1]: balena.service: Killing process 766 (balenad) with signal SIGABRT.
Nov 11 22:37:26 121a4e5 systemd[1]: balena.service: Killing process 771 (exe) with signal SIGABRT.
Nov 11 22:37:26 121a4e5 systemd[1]: balena.service: Killing process 803 (balena-engine-c) with signal SIGABRT.
Nov 11 22:37:26 121a4e5 systemd[1]: balena.service: Killing process 1022 (balena-engine-c) with signal SIGABRT.
Nov 11 22:37:26 121a4e5 systemd[1]: balena.service: Killing process 1039 (balena-engine-c) with signal SIGABRT.
Nov 11 22:37:26 121a4e5 balenad[766]: SIGABRT: abort
Nov 11 22:37:26 121a4e5 balenad[766]: PC=0x460df1 m=0 sigcode=0
Nov 11 22:37:26 121a4e5 balenad[766]: goroutine 5 [syscall]:
Nov 11 22:37:26 121a4e5 balenad[766]: runtime.notetsleepg(0x269fa20SIGABRT: abort,
Nov 11 22:37:26 121a4e5 balenad[766]: 0xffffffffffffffffPC=, 0x460df10xffffffffffffffff m=0 sigcode=0)
Nov 11 22:37:26 121a4e5 balenad[766]:

After that, containerd/engine won't come back healthy:

Nov 11 22:37:27 121a4e5 balenad[6087]: time="2019-11-11T22:37:27.641423431Z" level=info msg="Removing stale sandbox 0bf33e9651acc38478dcc30b6350e2541054506585a258e49cbcc3b4a0e76716 (0a3da4cc75abe3b516343b754dbbfbbb7084916cc65b11d19f58c898e1f037f2)"
Nov 11 22:37:27 121a4e5 balenad[6087]: time="2019-11-11T22:37:27.645279724Z" level=warning msg="Error (Unable to complete atomic operation, key modified) deleting object [endpoint 3b35a9feb8c552f7c600a4062d06c1b42f78ecebd9ed706663da4820a5f83b7a c02a7a09485693b4025009558eea8b40d2fcf018f4f0a17d7196d685a8e692ac], retrying...."
Nov 11 22:37:27 121a4e5 balenad[6087]: time="2019-11-11T22:37:27.687019538Z" level=info msg="Removing stale sandbox 7e665417789347ce2a16b8107d83aa40978090dfbfb7dba3475fec775dc6d4b1 (62ca881a2b52e92d7ed285ee8c587a2a1a046b2b2a0284b64f89564b86fb029e)"
Nov 11 22:37:27 121a4e5 balenad[6087]: time="2019-11-11T22:37:27.689050881Z" level=warning msg="Error (Unable to complete atomic operation, key modified) deleting object [endpoint 3b35a9feb8c552f7c600a4062d06c1b42f78ecebd9ed706663da4820a5f83b7a 087aca77e19e35783738907cb353514751715f72bf5cd92ce021c6faafee0e31], retrying...."
Nov 11 22:37:27 121a4e5 balenad[6087]: time="2019-11-11T22:37:27.814958614Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/0a3da4cc75abe3b516343b754dbbfbbb7084916cc65b11d19f58c898e1f037f2/shim.sock debug=false pid=6331
Nov 11 22:37:27 121a4e5 balenad[6087]: time="2019-11-11T22:37:27.837139863Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/5300528fca1ceb38bffb9da8fa15155692c9bf1e044be9cc6013f7cb03816f31/shim.sock debug=false pid=6341
Nov 11 22:37:27 121a4e5 balenad[6087]: time="2019-11-11T22:37:27.842209447Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/62ca881a2b52e92d7ed285ee8c587a2a1a046b2b2a0284b64f89564b86fb029e/shim.sock debug=false pid=6342
Nov 11 22:38:56 121a4e5 systemd[1]: balena.service: Start operation timed out. Terminating.
Nov 11 22:38:56 121a4e5 balenad[6087]: time="2019-11-11T22:38:56.945500157Z" level=info msg="Processing signal 'terminated'"
Nov 11 22:38:56 121a4e5 balenad[6087]: time="2019-11-11T22:38:56.948757033Z" level=error msg="failed to kill shim" error="context canceled: unknown"
Nov 11 22:38:56 121a4e5 balenad[6087]: time="2019-11-11T22:38:56.949585214Z" level=error msg="failed to get event" error="rpc error: code = Unavailable desc = transport is closing" module=libcontainerd namespace=plugins.moby
Nov 11 22:38:56 121a4e5 balenad[6087]: time="2019-11-11T22:38:56.949913739Z" level=error msg="failed to get event" error="rpc error: code = Unavailable desc = transport is closing" module=libcontainerd namespace=plugins.moby
Nov 11 22:38:56 121a4e5 balenad[6
`… about 88% shown.`
[](https://www.balena-cloud.com?hidden=whisper&source=front&flow=inb_nla&thread=cnv_3ola32b&hmac=b9d309f767d5bbfad4b11fbb2468636f5eb8fc6a5e27796292b350982b4c6625)
balena-ci commented 4 years ago

[pdcastro] This issue has attached support thread https://jel.ly.fish/#/support-thread~f84739ea-4bb1-4387-baa5-b8459ce7d5d6

jellyfish-bot commented 4 years ago

[lucianbuzzo] This issue has attached support thread https://jel.ly.fish/2567905c-cd9d-42e4-bb8b-66e9eab11adb

LucianBuzzo commented 4 years ago

A user has reported the same issue on the following device: Device type: Raspberry Pi 3 OS version: balenaOS 2.47.0+rev1 Supervisor version: 10.6.27

Here is the log output https://gist.github.com/Roger/b82a219489a73552392bc6764868c99f

jellyfish-bot commented 4 years ago

[kaisoz] This issue has attached support thread https://jel.ly.fish/00eb50cb-124a-4d7d-a27e-d9e2e08d4d6a

kaisoz commented 4 years ago

I've seen this error happening also on this device:

Application url: https://dashboard.balena-cloud.com/apps/1053081 Device URL: https://dashboard.balena-cloud.com/devices/429d927a405d2142400cf2ccff9f8b63 Device type: Raspberry Pi 3 OS version: balenaOS 2.47.0+rev1 Supervisor version: 10.6.27