balena-os / balenahup

BALENA Host os UPdater
https://balena.io/
36 stars 11 forks source link

balenahup can fail to pull the image, but still run hooks, soft-bricking Jetson devices #365

Open jakogut opened 3 years ago

jakogut commented 3 years ago

During HUP, a hostapp image is pulled, then the hostapp-update hooks are run, one of which writes a new partition table on Jetson boards. In some cases, a network outage can cause the pull to fail, but the HUP doesn't exit, and the hooks are still run. This can cause the device to become unbootable.

================upgrade-2.x.sh HEADER START====================
Thu Oct  7 18:58:48 UTC 2021
[000000001][LOG]Loading info from config.json
[000000002][LOG]Target version supports hostapps, no device type support check required.
[000000002][LOG]Target OS version "2.73.1+rev4" OK.
[000000002][LOG]OS variant: 2.45.1
[000000002][LOG]Host OS version "2.45.1+rev3" OK.
[000000002][LOG]Attempting host OS update using deltas
[000000007][LOG]Found delta image: registry2.balena-cloud.com/v2/2fba5b35247c586f0df32a485f6f9d23:delta-ca81a0ca3ed65a84, size: 295 MB
[000000007][LOG]No resin-device-progress fix is required...
[000000007][LOG]No supervisor updater fix is required...
[000000007][LOG]hostapp-update command exists, use that for update
[000000010][LOG]Running pre-update fixes for jetson-tx2
[000000010][LOG]Caching current extlinux.conf for jetson-tx2 fix
[000000010][LOG]Stopping supervisor to prevent reboots during extlinux.conf updating
[000000010][LOG]Stopping supervisor and related services...
[000000021][LOG]Starting hostapp-update
delta-ca81a0ca3ed65a84: Pulling from v2/2fba5b35247c586f0df32a485f6f9d23
18f1e34412a5: Pulling fs layer
18f1e34412a5: Ready to download
failed to register layer: Error processing tar file(exit status 1): unexpected EOF
[000000375][LOG]Image type delta, location 'registry2.balena-cloud.com/v2/2fba5b35247c586f0df32a485f6f9d23:delta-ca81a0ca3ed65a84' failed or not found, trying another source
[000000375][LOG]Running pre-update fixes for jetson-tx2
[000000375][LOG]Caching current extlinux.conf for jetson-tx2 fix
[000000375][LOG]Stopping supervisor to prevent reboots during extlinux.conf updating
[000000375][LOG]Stopping supervisor and related services...
[000000375][LOG]Starting hostapp-update
Error response from daemon: Get https://registry2.balena-cloud.com/v2/: dial tcp: lookup registry2.balena-cloud.com on 127.0.0.2:53: server misbehaving
[000000376][LOG]Image type balena_registry, location 'registry2.balena-cloud.com/v2/2fba5b35247c586f0df32a485f6f9d23@sha256:0a8583474a34f363bec1b8a2046c3b10a2516770627f840a0a7e237f9609b4a2' failed or not found, trying another source
[000000376][ERROR]all hostapp-update attempts have failed...
jellyfish-bot commented 3 years ago

[jakogut] This issue has attached support thread https://jel.ly.fish/41dd3d5f-1d03-41c3-ac3f-f6249b107778

acostach commented 3 years ago

@jakogut from the log attached no hooks appear to have ran. For soft-bricking to happen, first the new hooks must be run completely, then the old OS hooks must be run, which is not visible here.

This issue is addressed by https://github.com/balena-os/balenahup/pull/363

acostach commented 3 years ago

Hi @jakogut, I ran 3000 attempts on two boards and could not reproduce this issue in which hooks are run and soft brick the device. I think we should move this one to balena-hup for addressing retries in case of failure to download the image. What do you think?

jakogut commented 2 years ago

@acostach Sounds conclusive to me, let's move it.