balena-os / balena-engine

Moby-based Container Engine for Embedded, IoT, and Edge uses
https://www.balena.io
Apache License 2.0
694 stars 66 forks source link

Daemon errors with `(HTTP code 404) -- no such container: sandbox` #261

Open cywang117 opened 3 years ago

cywang117 commented 3 years ago

NOTE: For users and support agents arriving here in the future: since it's not clear how we can reproduce this issue, please find out more information about various conditions on the device. Some good starting questions and things to check:

Asking the user if they wouldn't mind leaving the device in this invalid state for engineers to investigate would also help, if the user is okay with this of course.

Description

balenaEngine daemon errors with (HTTP code 404) -- no such container: sandbox . However, there is no sandbox container on the device. This error is communicated by the device Supervisor from the journal logs with:

Device state apply error Error: Failed to apply state transition steps. (HTTP code 404) no such container - sandbox 915c9f1f78712e9db8bb1edf3d94fd669a917c608270f4c95e3a8c72de142b15 not found Steps:["updateMetadata"]

Per https://github.com/balena-io/balena-io/issues/1684, this might be due to a bad internal state with one of the containers on the device. The issue is fixed by restarting balenaEngine with systemctl restart balena OR systemctl stop balena-supervisor && balena stop $(balena ps -a -q) && balena rm $(balena ps -a -q) && systemctl start balena-supervisor, however this is not ideal as the containers experience a few minutes of downtime.

It's unclear how to reproduce this issue.

Additional information you deem important (e.g. issue happens only occasionally):

Issue happens when a new update is downloaded by the device. Has sometimes appeared in combination with #1579, making cause unclear.

Additional environment details (device type, OS, etc.):

Device Type: Raspberry Pi 4 64bit, 2GB RAM OS: balenaOS 2.80.3+rev1.prod

jellyfish-bot commented 3 years ago

[cywang117] This issue has attached support thread https://jel.ly.fish/72633746-3415-449a-9617-e123cba1e954

jellyfish-bot commented 3 years ago

[cywang117] This issue has attached support thread https://jel.ly.fish/e7428359-c335-4d00-81db-dfb4293d1423

cywang117 commented 3 years ago

The fact that stopping the Supervisor, removing the containers, and starting the Supervisor fixes the issue seems to indicate that this is a Supervisor issue and not a balenaEngine issue. I'll move this to the Supervisor repo

cywang117 commented 3 years ago

So it seems that just restarting the Supervisor without removing containers does not fix this issue. However, restarting balenaEngine fixes the issue. Now I'm unclear whether this is Supervisor related or balenaEngine related. I'm leaning towards this being related to balenaEngine having bad state for one of the containers on the device, as a Supervisor restart didn't do anything.

jellyfish-bot commented 3 years ago

[cywang117] This issue has attached support thread https://jel.ly.fish/661c8c96-8357-4bfc-9380-308a65fff910

jellyfish-bot commented 3 years ago

[danthegoodman1] This issue has attached support thread https://jel.ly.fish/a4f6be4b-50dc-454d-9c5c-dbcf168119db

cywang117 commented 3 years ago

@lmbarros @robertgzr Drawing your attention to some edits I made to this GitHub issue:

NOTE: For users and support agents arriving here in the future: since it's not clear how we can reproduce this issue, please find out more information about various conditions on the device. Some good starting questions and things to check:

  • Did this error appear after a release update?
  • Are deltas enabled?
  • Does the release build use intermediate containers? (If not sure, looking at the Dockerfile(s) of the containers would tell you)
  • Any other questions which you think might be relevant.

Asking the user if they wouldn't mind leaving the device in this invalid state for engineers to investigate would also help, if the user is okay with this of course.

Are there any other questions that you think would be useful in investigating the causes behind this issue? Could this kind of problem be something that is unavoidable based on current implementation limitations in dependencies (Moby)?

jellyfish-bot commented 3 years ago

[pipex] This issue has attached support thread https://jel.ly.fish/dc8d2638-ebb4-4ba8-8ae6-edae48602850

jellyfish-bot commented 3 years ago

[pipex] This issue has attached support thread https://jel.ly.fish/e82fe388-3955-4252-97c4-6c837151cce2

jellyfish-bot commented 2 years ago

[pipex] This issue has attached support thread https://jel.ly.fish/b7fa70df-ad99-4deb-8f6a-2b78d2f47a44

pipex commented 2 years ago

Some extra information for this ticket, this has been reported to be happening more with containers that don't get updated as frequently as others. So a container that has been renamed a few times while others have been recreated may sometimes get into this state

For instance, for a particular device, the failing container shows a network prefix of 16

root@4cd008d3ffa1:/opt# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
15: eth0@if16: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 02:42:ac:11:00:05 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.17.0.5/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever

While other veth networks have a much larger prefix, confirming that this is an old network.

root@c73b31f:~# ip a | grep veth
1291: veth6f7ff99@if1290: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue master br-2f18a4b13b86 
16: veth367da35@if15: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue master br-2f18a4b13b86 
1380: veth72261a3@if1379: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue master br-2f18a4b13b86 
1180: vethe52f1a4@if1179: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue master br-2f18a4b13b86

Could this issue be an unintended side effect of some cleanup process?

jellyfish-bot commented 2 years ago

[gantonayde] This issue has attached support thread https://jel.ly.fish/1b57a2f7-e2b2-4658-94ef-0a35bef04f4b

jellyfish-bot commented 2 years ago

[pipex] This issue has attached support thread https://jel.ly.fish/bf30fa84-cc92-4cf8-aefd-4c2f14c4a944

jellyfish-bot commented 2 years ago

[nitish] This issue has attached support thread https://jel.ly.fish/9f4bc524-e6d5-4480-98a5-4d2cefba84f3

vipulgupta2048 commented 2 years ago

Did this error appear after a release update? Yep Are deltas enabled? Yes Does the release build use intermediate containers? Indeed, 2 stages

Happened on a new device with just the second release I pushed on it, running a minimal server application (200 mb image, 2 stage build process). Error is below:

Jun 02 20:25:35 a01a838 balena-supervisor[2376]: [info]    Applying target state
Jun 02 20:25:36 a01a838 balena-supervisor[2376]: [error]   Scheduling another update attempt in 1000ms due to failure:  Error: Failed to appl>
Jun 02 20:25:36 a01a838 balena-supervisor[2376]: [error]         at fn (/usr/src/app/dist/app.js:6:8690)
Jun 02 20:25:36 a01a838 balena-supervisor[2376]: [error]   Device state apply error Error: Failed to apply state transition steps. (HTTP code>
Jun 02 20:25:36 a01a838 balena-supervisor[2376]: [error]         at fn (/usr/src/app/dist/app.js:6:8690)
Jun 02 20:25:37 a01a838 balena-supervisor[2376]: [info]    Applying target state
Jun 02 20:25:38 a01a838 balena-supervisor[2376]: [error]   Scheduling another update attempt in 2000ms due to failure:  Error: Failed to appl>
Jun 02 20:25:38 a01a838 balena-supervisor[2376]: [error]         at fn (/usr/src/app/dist/app.js:6:8690)
Jun 02 20:25:38 a01a838 balena-supervisor[2376]: [error]   Device state apply error Error: Failed to apply state transition steps. (HTTP code>
Jun 02 20:25:38 a01a838 balena-supervisor[2376]: [error]         at fn (/usr/src/app/dist/app.js:6:8690)
Jun 02 20:25:40 a01a838 balena-supervisor[2376]: [info]    Applying target state

Attaching diagnostics File: a01a83846e174aa51dc2b33fbf0a17e7_diagnostics_2022.06.02_20.56.19+0000.txt

Adding outputs of commands balena info and balena version

root@a01a838:~# balena info
Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 2
  Running: 2
  Paused: 0
  Stopped: 0
 Images: 3
 Server Version: 20.10.12
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: journald
 Cgroup Driver: systemd
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host null
  Log: journald json-file local
 Swarm: 
  NodeID: 
  Is Manager: false
  Node Address: 
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: balena-engine-init
 containerd version: 
 runc version: 
 init version: 949e6fa-dirty (expected: de40ad007797e)
 Kernel Version: 5.10.83-v8
 Operating System: balenaOS 2.94.4
 OSType: linux
 Architecture: aarch64
 CPUs: 4
 Total Memory: 960MiB
 Name: a01a838
 ID: V47H:PCFQ:GMDT:PV3S:OW2J:FRXS:MRZ7:V737:5HEQ:BFCP:GBUS:SJOJ
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: true
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No blkio throttle.read_bps_device support
WARNING: No blkio throttle.write_bps_device support
WARNING: No blkio throttle.read_iops_device support
WARNING: No blkio throttle.write_iops_device support
root@a01a838:~# balena version
Client:
 Version:           20.10.12
 API version:       1.41
 Go version:        go1.16.2
 Git commit:        73c78258302d94f9652da995af6f65a621fac918
 Built:             Wed Mar  2 10:28:01 2022
 OS/Arch:           linux/arm64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.12
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.2
  Git commit:       73c78258302d94f9652da995af6f65a621fac918
  Built:            Wed Mar  2 10:28:01 2022
  OS/Arch:          linux/arm64
  Experimental:     true
 containerd:
  Version:          1.4.0+unknown
  GitCommit:        
 runc:
  Version:          spec: 1.0.2-dev
  GitCommit:        
 balena-engine-init:
  Version:          0.13.0
  GitCommit:        949e6fa-dirty

FD: https://www.flowdock.com/app/rulemotion/r-supervisor/threads/FQqETXXQaGFg1oLyWz7ccNbPgAx

jellyfish-bot commented 2 years ago

[lmbarros] This has attached https://jel.ly.fish/88b86997-9411-40b9-ae2f-8f3505febb93

jellyfish-bot commented 1 year ago

[pipex] This has attached https://jel.ly.fish/c09369f0-c870-4f93-9133-0ec8b995fda9