balena-os / balena-engine

Moby-based Container Engine for Embedded, IoT, and Edge uses
https://www.balena.io
Apache License 2.0
691 stars 66 forks source link

Balena engine crashes when applying deltas under high CPU load #263

Closed alexgg closed 2 years ago

alexgg commented 3 years ago

Description

When applying an application update on a RPI0 under heavy load (CPU utilization between 85% and 100%) the engine crashes with:

Aug 11 13:04:00 d3adeb5 balenad[22182]: panic: runtime error: slice bounds out of range
Aug 11 13:04:00 d3adeb5 balenad[22182]: goroutine 562 [running]:
Aug 11 13:04:00 d3adeb5 balenad[22182]: github.com/docker/docker/pkg/ioutils.(*concatReadSeekCloser).Read(0x46951d0, 0x510ba00, 0x200, 0x200, 0x11af2e0, 0x1, 0x510ba00)
Aug 11 13:04:00 d3adeb5 balenad[22182]:         /yocto/resin-board/build/tmp/work/arm1176jzfshf-vfp-poky-linux-gnueabi/balena/19.03.18+git840aacc77b6c600b3b929fe9e4d9356a322b9e5b-r0/git/src/import/.gopath/src/github.com/docker/docker/pkg/ioutils/concat.go:68 +0x328
Aug 11 13:04:00 d3adeb5 balenad[22182]: io.(*LimitedReader).Read(0x376d100, 0x510ba00, 0x200, 0x200, 0x0, 0x50e6700, 0x1b1bc)
Aug 11 13:04:00 d3adeb5 balenad[22182]:         /usr/lib/go/src/io/io.go:448 +0xc4
Aug 11 13:04:00 d3adeb5 balenad[22182]: io.copyBuffer(0x16bb1a8, 0x3b4a7b0, 0x16bb160, 0x376d100, 0x510ba00, 0x200, 0x200, 0x11ff2a0, 0x127e220, 0x0, ...)
Aug 11 13:04:00 d3adeb5 balenad[22182]:         /usr/lib/go/src/io/io.go:402 +0xd8
Aug 11 13:04:00 d3adeb5 balenad[22182]: io.Copy(...)
Aug 11 13:04:00 d3adeb5 balenad[22182]:         /usr/lib/go/src/io/io.go:364
Aug 11 13:04:00 d3adeb5 balenad[22182]: io.CopyN(0x16bb1a8, 0x3b4a7b0, 0xa52257c0, 0x46951d0, 0x200, 0x0, 0x0, 0x0, 0x0, 0x0)
Aug 11 13:04:00 d3adeb5 balenad[22182]:         /usr/lib/go/src/io/io.go:340 +0x8c
Aug 11 13:04:00 d3adeb5 balenad[22182]: github.com/docker/docker/vendor/github.com/balena-os/librsync-go.Patch(0xa5220c20, 0x46951d0, 0x16b8d18, 0x30ce480, 0x16bb1a8, 0x3b4a7b0, 0x0, 0x0)
Aug 11 13:04:00 d3adeb5 balenad[22182]:         /yocto/resin-board/build/tmp/work/arm1176jzfshf-vfp-poky-linux-gnueabi/balena/19.03.18+git840aacc77b6c600b3b929fe9e4d9356a322b9e5b-r0/git/src/import/.gopath/src/github.com/docker/docker/vendor/github.com/balena-os/librsync-go/patch.go:83 +0x1c0
Aug 11 13:04:00 d3adeb5 balenad[22182]: github.com/docker/docker/distribution/xfer.(*LayerDownloadManager).makeDownloadFunc.func1.1.2(0x16c07a0, 0x36b8d50, 0x39204e0, 0x3b4a7b0, 0xa5220c20, 0x46951d0)
Aug 11 13:04:00 d3adeb5 balenad[22182]:         /yocto/resin-board/build/tmp/work/arm1176jzfshf-vfp-poky-linux-gnueabi/balena/19.03.18+git840aacc77b6c600b3b929fe9e4d9356a322b9e5b-r0/git/src/import/.gopath/src/github.com/docker/docker/distribution/xfer/download.go:368 +0x12c
Aug 11 13:04:00 d3adeb5 balenad[22182]: created by github.com/docker/docker/distribution/xfer.(*LayerDownloadManager).makeDownloadFunc.func1.1
Aug 11 13:04:00 d3adeb5 balenad[22182]:         /yocto/resin-board/build/tmp/work/arm1176jzfshf-vfp-poky-linux-gnueabi/balena/19.03.18+git840aacc77b6c600b3b929fe9e4d9356a322b9e5b-r0/git/src/import/.gopath/src/github.com/docker/docker/distribution/xfer/download.go:358 +0x1024
Aug 11 13:04:00 d3adeb5 systemd[1]: balena.service: Main process exited, code=exited,

Steps to reproduce the issue:

  1. Use a RPI0 and runs a stress test application that sets a high CPU load
  2. Perform an application update via balena cloud
  3. Typically the update should take a long time to complete, but occasionally the engine fails with the panic above

Describe the results you received: The engine is restarted, the update needs to start from scratch so it either takes forever to complete or never does.

Describe the results you expected: There is no panic and the update suceeds.

Additional information you deem important (e.g. issue happens only occasionally): See https://jel.ly.fish/c7b7fa69-6338-4458-ab56-685e4e0feb5d

Output of balena-engine version:

https://jel.ly.fish/c7b7fa69-6338-4458-ab56-685e4e0feb5d

Output of balena-engine info:

Client:
 Debug Mode: false

Server:
 Containers: 3
  Running: 2
  Paused: 0
  Stopped: 1
 Images: 5
 Server Version: 19.03.18
 Storage Driver: aufs
  Root Dir: /var/lib/docker/aufs
  Backing Filesystem: extfs
  Dirs: 73
  Dirperm1 Supported: true
 Logging Driver: journald
 Cgroup Driver: systemd
 Plugins:
  Volume: local
  Network: bridge host null
  Log: journald json-file local
 Swarm: 
  NodeID: 
  Is Manager: false
  Node Address: 
 Runtimes: bare runc
 Default Runtime: runc
 Init Binary: balena-engine-init
 containerd version: 
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: 949e6fa-dirty (expected: fec3683b971d9)
 Kernel Version: 5.4.83
 Operating System: balenaOS 2.80.8+rev2
 OSType: linux
 Architecture: armv6l
 CPUs: 1
 Total Memory: 478MiB
 Name: d3adeb5
 ID: IYIH:T6OJ:GKER:7QFE:7FJH:B35Z:2JCL:EQKY:NG6V:UF4K:N2T7:BZZX
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: true
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: API is accessible on http://0.0.0.0:2375 without encryption.
         Access to the remote API is equivalent to root access on the host. Refer
         to the 'Docker daemon attack surface' section in the documentation for
         more information: https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface
WARNING: No swap limit support
WARNING: No cpuset support
WARNING: the aufs storage-driver is deprecated, and will be removed in a future release.

Additional environment details (device type, OS, etc.):

ID="balena-os"
NAME="balenaOS"
VERSION="2.80.8+rev2"
VERSION_ID="2.80.8+rev2"
PRETTY_NAME="balenaOS 2.80.8+rev2"
MACHINE="raspberrypi"
VARIANT="Development"
VARIANT_ID=dev
META_BALENA_VERSION="2.80.8"
BALENA_BOARD_REV="418629c"
META_BALENA_REV="0315fc1f"
SLUG="raspberry-pi"
jellyfish-bot commented 3 years ago

[alexgg] This issue has attached support thread https://jel.ly.fish/c7b7fa69-6338-4458-ab56-685e4e0feb5d

Hades32 commented 3 years ago

@robertgzr that crashing line looks strange to begin with: https://github.com/balena-os/balena-engine/blame/47b47a653a44d228d70baa99469b0c36c547889c/pkg/ioutils/concat.go#L68

jellyfish-bot commented 3 years ago

[nghiant2710] This issue has attached support thread https://jel.ly.fish/72ba67af-9091-4074-a489-861b8735230d

alexgg commented 3 years ago

Detected twice, 2.80.8+rev2 (raspberry-pi) and 2.51.1+rev1(fincm3). So even though this is an engine crash it does not seem a regression but maybe a peculiarity of the delta data - also noteworthy both are ARM 32bits RPI devices.

jellyfish-bot commented 3 years ago

[alexgg] This issue has attached support thread https://jel.ly.fish/002624d2-3d0b-4ea1-969a-6657519351e9

jellyfish-bot commented 2 years ago

[klutchell] This issue has attached support thread https://jel.ly.fish/88aadc7a-408c-44f7-8564-f18de52129a5

jellyfish-bot commented 2 years ago

[lmbarros] This issue has attached support thread https://jel.ly.fish/842063d6-c080-43f6-b53d-afef4c8d1626

lmbarros commented 2 years ago

I think the problem is here. With a 32-bit platform and a large enough delta, (self.aSize - self.off) will not fit into a (32-bit) int. (Fix might not be that straightforward, because go slices are indexed with ints and limited in size to whatever fits in an int for the particular platform.)