Open cywang117 opened 3 years ago
[cywang117] This issue has attached support thread https://jel.ly.fish/72633746-3415-449a-9617-e123cba1e954
[cywang117] This issue has attached support thread https://jel.ly.fish/e7428359-c335-4d00-81db-dfb4293d1423
The fact that stopping the Supervisor, removing the containers, and starting the Supervisor fixes the issue seems to indicate that this is a Supervisor issue and not a balenaEngine issue. I'll move this to the Supervisor repo
So it seems that just restarting the Supervisor without removing containers does not fix this issue. However, restarting balenaEngine fixes the issue. Now I'm unclear whether this is Supervisor related or balenaEngine related. I'm leaning towards this being related to balenaEngine having bad state for one of the containers on the device, as a Supervisor restart didn't do anything.
[cywang117] This issue has attached support thread https://jel.ly.fish/661c8c96-8357-4bfc-9380-308a65fff910
[danthegoodman1] This issue has attached support thread https://jel.ly.fish/a4f6be4b-50dc-454d-9c5c-dbcf168119db
@lmbarros @robertgzr Drawing your attention to some edits I made to this GitHub issue:
NOTE: For users and support agents arriving here in the future: since it's not clear how we can reproduce this issue, please find out more information about various conditions on the device. Some good starting questions and things to check:
- Did this error appear after a release update?
- Are deltas enabled?
- Does the release build use intermediate containers? (If not sure, looking at the Dockerfile(s) of the containers would tell you)
- Any other questions which you think might be relevant.
Asking the user if they wouldn't mind leaving the device in this invalid state for engineers to investigate would also help, if the user is okay with this of course.
Are there any other questions that you think would be useful in investigating the causes behind this issue? Could this kind of problem be something that is unavoidable based on current implementation limitations in dependencies (Moby)?
[pipex] This issue has attached support thread https://jel.ly.fish/dc8d2638-ebb4-4ba8-8ae6-edae48602850
[pipex] This issue has attached support thread https://jel.ly.fish/e82fe388-3955-4252-97c4-6c837151cce2
[pipex] This issue has attached support thread https://jel.ly.fish/b7fa70df-ad99-4deb-8f6a-2b78d2f47a44
Some extra information for this ticket, this has been reported to be happening more with containers that don't get updated as frequently as others. So a container that has been renamed a few times while others have been recreated may sometimes get into this state
For instance, for a particular device, the failing container shows a network prefix of 16
root@4cd008d3ffa1:/opt# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
15: eth0@if16: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:ac:11:00:05 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 172.17.0.5/16 brd 172.17.255.255 scope global eth0
valid_lft forever preferred_lft forever
While other veth networks have a much larger prefix, confirming that this is an old network.
root@c73b31f:~# ip a | grep veth
1291: veth6f7ff99@if1290: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue master br-2f18a4b13b86
16: veth367da35@if15: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue master br-2f18a4b13b86
1380: veth72261a3@if1379: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue master br-2f18a4b13b86
1180: vethe52f1a4@if1179: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue master br-2f18a4b13b86
Could this issue be an unintended side effect of some cleanup process?
[gantonayde] This issue has attached support thread https://jel.ly.fish/1b57a2f7-e2b2-4658-94ef-0a35bef04f4b
[pipex] This issue has attached support thread https://jel.ly.fish/bf30fa84-cc92-4cf8-aefd-4c2f14c4a944
[nitish] This issue has attached support thread https://jel.ly.fish/9f4bc524-e6d5-4480-98a5-4d2cefba84f3
Did this error appear after a release update? Yep Are deltas enabled? Yes Does the release build use intermediate containers? Indeed, 2 stages
Happened on a new device with just the second release I pushed on it, running a minimal server application (200 mb image, 2 stage build process). Error is below:
Jun 02 20:25:35 a01a838 balena-supervisor[2376]: [info] Applying target state
Jun 02 20:25:36 a01a838 balena-supervisor[2376]: [error] Scheduling another update attempt in 1000ms due to failure: Error: Failed to appl>
Jun 02 20:25:36 a01a838 balena-supervisor[2376]: [error] at fn (/usr/src/app/dist/app.js:6:8690)
Jun 02 20:25:36 a01a838 balena-supervisor[2376]: [error] Device state apply error Error: Failed to apply state transition steps. (HTTP code>
Jun 02 20:25:36 a01a838 balena-supervisor[2376]: [error] at fn (/usr/src/app/dist/app.js:6:8690)
Jun 02 20:25:37 a01a838 balena-supervisor[2376]: [info] Applying target state
Jun 02 20:25:38 a01a838 balena-supervisor[2376]: [error] Scheduling another update attempt in 2000ms due to failure: Error: Failed to appl>
Jun 02 20:25:38 a01a838 balena-supervisor[2376]: [error] at fn (/usr/src/app/dist/app.js:6:8690)
Jun 02 20:25:38 a01a838 balena-supervisor[2376]: [error] Device state apply error Error: Failed to apply state transition steps. (HTTP code>
Jun 02 20:25:38 a01a838 balena-supervisor[2376]: [error] at fn (/usr/src/app/dist/app.js:6:8690)
Jun 02 20:25:40 a01a838 balena-supervisor[2376]: [info] Applying target state
Attaching diagnostics File: a01a83846e174aa51dc2b33fbf0a17e7_diagnostics_2022.06.02_20.56.19+0000.txt
Adding outputs of commands balena info
and balena version
root@a01a838:~# balena info
Client:
Context: default
Debug Mode: false
Server:
Containers: 2
Running: 2
Paused: 0
Stopped: 0
Images: 3
Server Version: 20.10.12
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: journald
Cgroup Driver: systemd
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host null
Log: journald json-file local
Swarm:
NodeID:
Is Manager: false
Node Address:
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: balena-engine-init
containerd version:
runc version:
init version: 949e6fa-dirty (expected: de40ad007797e)
Kernel Version: 5.10.83-v8
Operating System: balenaOS 2.94.4
OSType: linux
Architecture: aarch64
CPUs: 4
Total Memory: 960MiB
Name: a01a838
ID: V47H:PCFQ:GMDT:PV3S:OW2J:FRXS:MRZ7:V737:5HEQ:BFCP:GBUS:SJOJ
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: true
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No blkio throttle.read_bps_device support
WARNING: No blkio throttle.write_bps_device support
WARNING: No blkio throttle.read_iops_device support
WARNING: No blkio throttle.write_iops_device support
root@a01a838:~# balena version
Client:
Version: 20.10.12
API version: 1.41
Go version: go1.16.2
Git commit: 73c78258302d94f9652da995af6f65a621fac918
Built: Wed Mar 2 10:28:01 2022
OS/Arch: linux/arm64
Context: default
Experimental: true
Server:
Engine:
Version: 20.10.12
API version: 1.41 (minimum version 1.12)
Go version: go1.16.2
Git commit: 73c78258302d94f9652da995af6f65a621fac918
Built: Wed Mar 2 10:28:01 2022
OS/Arch: linux/arm64
Experimental: true
containerd:
Version: 1.4.0+unknown
GitCommit:
runc:
Version: spec: 1.0.2-dev
GitCommit:
balena-engine-init:
Version: 0.13.0
GitCommit: 949e6fa-dirty
FD: https://www.flowdock.com/app/rulemotion/r-supervisor/threads/FQqETXXQaGFg1oLyWz7ccNbPgAx
[lmbarros] This has attached https://jel.ly.fish/88b86997-9411-40b9-ae2f-8f3505febb93
[pipex] This has attached https://jel.ly.fish/c09369f0-c870-4f93-9133-0ec8b995fda9
NOTE: For users and support agents arriving here in the future: since it's not clear how we can reproduce this issue, please find out more information about various conditions on the device. Some good starting questions and things to check:
Asking the user if they wouldn't mind leaving the device in this invalid state for engineers to investigate would also help, if the user is okay with this of course.
Description
balenaEngine daemon errors with (HTTP code 404) -- no such container: sandbox. However, there is no
sandbox
container on the device. This error is communicated by the device Supervisor from the journal logs with:Device state apply error Error: Failed to apply state transition steps. (HTTP code 404) no such container - sandbox 915c9f1f78712e9db8bb1edf3d94fd669a917c608270f4c95e3a8c72de142b15 not found Steps:["updateMetadata"]
Per https://github.com/balena-io/balena-io/issues/1684, this might be due to a bad internal state with one of the containers on the device. The issue is fixed by restarting balenaEngine with
systemctl restart balena
ORsystemctl stop balena-supervisor && balena stop $(balena ps -a -q) && balena rm $(balena ps -a -q) && systemctl start balena-supervisor
, however this is not ideal as the containers experience a few minutes of downtime.It's unclear how to reproduce this issue.
Additional information you deem important (e.g. issue happens only occasionally):
Issue happens when a new update is downloaded by the device. Has sometimes appeared in combination with #1579, making cause unclear.
Additional environment details (device type, OS, etc.):
Device Type: Raspberry Pi 4 64bit, 2GB RAM OS: balenaOS 2.80.3+rev1.prod