balena-os / balena-supervisor

Balena Supervisor: balena's agent on devices.
https://balena.io
Other
147 stars 63 forks source link

Incorrect Delta Update. #1513

Open splitice opened 3 years ago

splitice commented 3 years ago

We have been seeing this issue for quite a while now, but it's so rare (approx 1/15-20 builds) that it's proven hard to debug.

The issue: The delta update for some releases can be missing files.

The directory on a full update hub:

total 2945
   1 drwxr-xr-x 2 halley halley    1024 Nov  8 23:49 .
   1 drwxr-xr-x 3 halley halley    1024 Nov  8 23:49 ..
   1 -rw-r--r-- 1 halley halley      14 Nov  8 23:49 gateway.version
1158 -rw-r--r-- 1 halley halley 1185289 Nov  8 10:07 gateway.zip
   1 -rw-r--r-- 1 halley halley      14 Nov  8 23:49 nightlight.version
 499 -rw-r--r-- 1 halley halley  510136 Nov  8 09:49 nightlight.zip
   1 -rw-r--r-- 1 halley halley      14 Nov  8 23:49 process_manager.version
 396 -rw-r--r-- 1 halley halley  404662 Nov  8 09:48 process_manager.zip
   1 -rw-r--r-- 1 halley halley      14 Nov  8 23:49 processor.version
 260 -rw-r--r-- 1 halley halley  265350 Nov  8 09:48 processor.zip
   1 -rw-r--r-- 1 halley halley      14 Nov  8 23:49 wellness.version
 625 -rw-r--r-- 1 halley halley  639960 Nov  8 10:41 wellness.zip

The directory on a hub which received a delta update:

total 1823
   2 drwxr-xr-x 1 halley halley    1024 Nov  9 00:27 .
   2 drwxr-xr-x 1 halley halley    1024 Nov  8 23:49 ..
   1 -rw-r--r-- 1 halley halley      14 Nov  8 23:49 gateway.version
1158 -rw-r--r-- 1 halley halley 1185289 Nov  8 10:07 gateway.zip
   1 -rw-r--r-- 1 halley halley      14 Nov  8 23:49 nightlight.version
   1 -rw-r--r-- 1 halley halley      14 Nov  8 23:49 process_manager.version
 396 -rw-r--r-- 1 halley halley  404662 Nov  8 09:48 process_manager.zip
   1 -rw-r--r-- 1 halley halley      14 Nov  8 23:49 processor.version
 260 -rw-r--r-- 1 halley halley  265350 Nov  8 09:48 processor.zip
   1 -rw-r--r-- 1 halley halley      14 Nov  8 23:49 wellness.version

Before update on the delta updated hub wellness.zip was matching the file that would habeen in the container. All the zip files in that directory changed with this release.

We use the balena ARMv7/v8 builders via balena push.

We have confirmed this on supervisor versions 11.5.1 and 10.6.23

I am not certain if this is a supervisor issue.

splitice commented 3 years ago

Updating back to the previous version (delta) restores the missing files. Updating (again) to the new faulty version (delta) worked.

So it appears to be part of the update process not build process.

splitice commented 3 years ago

Confirmed it's not always the same files. From a different hub.

gateway.version  gateway.zip  nightlight.version  nightlight.zip  process_manager.version  process_manager.zip  processor.version  wellness.version

This one got nightlight but no wellness.

splitice commented 2 years ago

Had another occurance of this today.

In this case the VPN service was interrupted during (appears to have been a IP routing error, no route to AWS IPs) the upgrade.

We ended up missing a couple files on disk (files that existed in the previous version, and the new).

I assume the VPN interruption was the cause (and not an effect) as it continued for a while after.

20k-ultra commented 2 years ago

Hey, you're saying the device (hub) pulls all files when not using deltas but in rare occurrences, the same release but downloaded via delta was missing some data (zip files).

Is this correct ? I have never heard of this happen and seems extremely unlikely but if you're saying it's happening I will be sure to look into it.

splitice commented 2 years ago

We have seen it enough times now that we can be certain of a bug.

We have done upwards of 10k delta updates easy over our time (and seen this only 3 times that I've noticed / can recall) so rare for sure.

This is the first time I've observed something like a VPN failure during the update (that may be the cause).

dfunckt commented 2 years ago

@splitice when this happens does it happen across your devices or just some of them?

splitice commented 2 years ago

Just one (at random)

dfunckt commented 2 years ago

If this happens again can you please reach out to support to investigate the specific device? We'd be interested in its usage patterns, update frequency, its environment, etc. It sounds less like a Supervisor/deltas issue and more like some kind of file system blip.

splitice commented 1 year ago

We had this happen again.

I was able to confirm its on the client delta side as a rollback to previous version, then a re-upgrade to the target version resolved the issue.

Strangely the files that are missing are some of the largest on the disk again.

Since last time we saw this we have done probably 1K or so delta updates.

The balena side logs are free of any messages:

Downloading delta for image 'registry2.balena-cloud.com/v2/65ef50a0acb7d255fa0c877896894756@sha256:dc10f11c8b900f30b1e4a30192d79917e0b53326fdcd8245ebb091b5f533934a'
Downloaded image 'registry2.balena-cloud.com/v2/65ef50a0acb7d255fa0c877896894756@sha256:dc10f11c8b900f30b1e4a30192d79917e0b53326fdcd8245ebb091b5f533934a'
Killing service 'main sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08'

Supervisor also looks good:

[debug]   delta([main] registry2.balena-cloud.com/v2/65ef50a0acb7d255fa0c877896894756@sha256:dc10f11c8b900f30b1e4a30192d79917e0b53326fdcd8245ebb091b5f533934a): Delta applied successfully
[event]   Event: Image downloaded {"image":{"name":"registry2.balena-cloud.com/v2/944d017e1c2f579f87bbce877dfde060@sha256:189a5926cc2c3020617e584e519663de6184bfebb2215c925d242f5d17cc0d10","appId":738415,"appUuid":"00fa17057f6c46ec987cabdf90fbe8ef","serviceId":2092,"serviceName":"main","imageId":5847659,"releaseId":2410051,"commit":"96ab71d67aa456b9a555cc0d8d1ff2dc","dependent":0}}
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn]    Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[debug]   Replacing container for service main because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08","environment":{"CONTAINER_VERSION":"v0.1.0-build+c79074772f07e1a1cdecc1c1fd4630c7052298f0"}}}
[debug]   Replacing container for service main because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08","environment":{"CONTAINER_VERSION":"v0.1.0-build+c79074772f07e1a1cdecc1c1fd4630c7052298f0"}}}
[event]   Event: Service kill {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","commit":"437832610aa0ae4a9904a0be595cdb11","releaseId":2400727}}
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn]    Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[debug]   Replacing container for service main because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08","environment":{"CONTAINER_VERSION":"v0.1.0-build+c79074772f07e1a1cdecc1c1fd4630c7052298f0"}}}
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn]    Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[debug]   Replacing container for service main because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08","environment":{"CONTAINER_VERSION":"v0.1.0-build+c79074772f07e1a1cdecc1c1fd4630c7052298f0"}}}
[info]    Reported current state to the cloud
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn]    Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[debug]   Replacing container for service main because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08","environment":{"CONTAINER_VERSION":"v0.1.0-build+c79074772f07e1a1cdecc1c1fd4630c7052298f0"}}}
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn]    Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[debug]   Replacing container for service main because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08","environment":{"CONTAINER_VERSION":"v0.1.0-build+c79074772f07e1a1cdecc1c1fd4630c7052298f0"}}}
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn]    Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[debug]   Replacing container for service main because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08","environment":{"CONTAINER_VERSION":"v0.1.0-build+c79074772f07e1a1cdecc1c1fd4630c7052298f0"}}}
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn]    Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[debug]   Replacing container for service main because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08","environment":{"CONTAINER_VERSION":"v0.1.0-build+c79074772f07e1a1cdecc1c1fd4630c7052298f0"}}}
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn]    Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[event]   Event: Service exit {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","commit":"437832610aa0ae4a9904a0be595cdb11","releaseId":2400727}}
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[event]   Event: Service stop {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","commit":"437832610aa0ae4a9904a0be595cdb11","releaseId":2400727}}
[debug]   Replacing container for service main because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08","environment":{"CONTAINER_VERSION":"v0.1.0-build+c79074772f07e1a1cdecc1c1fd4630c7052298f0"}}}
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn]    Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[event]   Event: Image removal {"image":{"name":"registry2.balena-cloud.com/v2/65ef50a0acb7d255fa0c877896894756@sha256:dc10f11c8b900f30b1e4a30192d79917e0b53326fdcd8245ebb091b5f533934a","appId":738415,"serviceId":2092,"serviceName":"main","imageId":5817405,"releaseId":2400727,"dependent":0,"appUuid":"00fa17057f6c46ec987cabdf90fbe8ef","commit":"437832610aa0ae4a9904a0be595cdb11"}}
[event]   Event: Image removed {"image":{"name":"registry2.balena-cloud.com/v2/65ef50a0acb7d255fa0c877896894756@sha256:dc10f11c8b900f30b1e4a30192d79917e0b53326fdcd8245ebb091b5f533934a","appId":738415,"serviceId":2092,"serviceName":"main","imageId":5817405,"releaseId":2400727,"dependent":0,"appUuid":"00fa17057f6c46ec987cabdf90fbe8ef","commit":"437832610aa0ae4a9904a0be595cdb11"}}
[info]    Reported current state to the cloud
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn]    Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[event]   Event: Service install {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","commit":"96ab71d67aa456b9a555cc0d8d1ff2dc","releaseId":2410051}}
[event]   Event: Service installed {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","commit":"96ab71d67aa456b9a555cc0d8d1ff2dc","releaseId":2410051}}
[event]   Event: Service start {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","commit":"96ab71d67aa456b9a555cc0d8d1ff2dc","releaseId":2410051}}
[event]   Event: Service started {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","commit":"96ab71d67aa456b9a555cc0d8d1ff2dc","releaseId":2410051}}

Nothing else that I can see of relevance in journalctl

splitice commented 1 year ago

We can also confirm that the files are in the image as we have a ls -las in the build script to confirm it (these specific files have gone missing from delta updates before).

cywang117 commented 1 year ago

@splitice Thanks for continuing to update this. Do you have a reproduction that we could use to debug?

splitice commented 1 year ago

@cywang117 this has been a long term troubleshooting issue for us. It's so infrequent that its quite difficult to test for.

Today we had a fresh installation (first install) that was missing a valid /etc/passwd the file contained what I expect is the base layers default (missing all the users we add in our layer). This was after a device power cycle on the first install.

This got me thinking if the issue might be more related to balena engine.

So what I did is took away the devices internet access. From this point I used serial. No re-pulling images or anything from the balena container registry. And I deleted the container with balena rm.

Balena supervisor re-created the container and it had all the correct files. So the image is correct! it's just the container that is wrong.

My thoughts for potential causes:

  1. Some sort of sync bug. Perhaps a filesystem sync is needed at the end of the image extraction?
  2. Perhaps there is an issue with the image extraction? Is this different for delta images than regular images?

If you agree with my finger pointing towards balena engine I'm happy to move this ticket there.

splitice commented 1 year ago

Still actively trying to find this issue when it occurs (it happened again today). Its incredibly rare (between today and January we have done hundreds of container updates, mostly delta)

This time the file that was incorrect was /etc/passwd, specifically the /etc/passwd from a parent layer was found in the place of the /etc/passwd present in the image. This is a hint to us as in the past the file was always missing (but it would also not be present in parent layers)

I was able to verify the image was correct, I exported the image from the hostos using balena image save. When inspected the layer tar (only one in this image) contained the correct /etc/passwd. So where did this other /etc/passwd (which looks like the default for a debian image) come from, its not in the image produced by the balena builder (we use a build & scratch image to create a single layer image)?

I'm starting to think that this is an issue with delta creation rather than on the device side due to the seeming impossibility of this file being introduced any other way.

Happy to do something to authorize balena to take a look.

Release: e2a4743cfcf45f32154d0a080746e457 (id: 2672092) and the balena fleet device was upgraded using the delta from 3b28561d13a260dd935ce13364a96637

What I think is that in very rare circumstances the build time delta is wrong. If I had more time today I would take a look at the balena-supervisor delta download API and see if I could get the delta file (and check its contents).

dfunckt commented 1 year ago

I'm starting to think that this is an issue with delta creation rather than on the device side due to the seeming impossibility of this file being introduced any other way.

If this does not happen consistently across all devices that update between the same releases, it is impossible to be an issue with deltas generation.


What Supervisor version(s) did you hit this with most recently? Can you share the fleet ID and enable support access so I can take a look myself?

dfunckt commented 1 year ago

I was able to verify the image was correct, I exported the image from the hostos using balena image save. [...]

Which image did you export? The one from the new release after the delta has been applied?

splitice commented 1 year ago

@dfunckt I can't reliably attest to the state of other devices (I didnt check, and I'm not sure there are most within the team pick and choose their updates). In the past though it seemed to only be a single device failing.

Supervisor version is a bit on the old side (14.0.8) however this issue has been fairly consistent for years. We keep our BSP within a year when possible. We are currently behind and this is something we are currently behind in and intend to rectify soon.

I exported the image the container was running. The device in question (and the output below) has since been rolled forward to a new image (which cleared the issue for the device) but I exported the image id as indicated by both the delta-* and latest tags (same id)

registry2.balena-cloud.com/v2/7dc307cd16e6f24ebfd08424056304df   delta-1f059135aaa2521e   24399faeb9b2        28 minutes ago      354MB
registry2.balena-cloud.com/v2/7dc307cd16e6f24ebfd08424056304df   latest                   24399faeb9b2        28 minutes ago      354MB

Fleet ID is 738415 and support access is enabled.

Other than erroneus or incomplete delta generation I'm struggling to think about how an /etc/passwd without the users added within the Dockerfile (e.g postgresql and halley) could make it through to a balena device. Not only does the image contain these users there isnt likely any reason for the device itself to change the contents of this file (and limited services with permissions to modify it anyway).

I guess it could be corruption between different images (say balena supervisor, the only other image in our engine). Perhaps I should have saved the contents of the incorrect file for comparison (sorry did not think of that).

dfunckt commented 1 year ago

Supervisor version is a bit on the old side (14.0.8) however this issue has been fairly consistent for years. [...]

I'm only interested in the Supervisor version to infer whether the device is using balena-engine deltas or the older rsync-based implementation. The Supervisor is doing nothing more than just pulling an image to apply deltas -- everything happens within the engine.

I exported the image id as indicated by both the delta-* and latest tags (same id)

You're saying that the same exact image when exported has all the files as expected, but when used to start a container from it's missing some files. This suggests to me some kind of engine store weirdness. That's very strange. cc @lmbarros

Other than erroneus or incomplete delta generation I'm struggling to think about how an /etc/passwd without the users added within the Dockerfile (e.g postgresql and halley) could make it through to a balena device. Not only does the image contain these users there isnt likely any reason for the device itself to change the contents of this file (and limited services with permissions to modify it anyway).

If it's delta corruption then we should be able to consistently reproduce just by pulling release A image locally and then apply the delta to release B image and start a container -- we should always get a container that it can't see some of the files. But this isn't the case as I understand it.

Fleet ID is 738415 and support access is enabled.

Thanks, I'm taking a quick look.

splitice commented 1 year ago

I'm exhausted after a long day but I'll see if I can rig something up to do so updates in a loop (say back and forth between these two releases). Just my device as to not interrupt the team.

This issue has been observed for too long and it's difficulty has led to it evading effort.

splitice commented 1 year ago

@dfunckt Took me a bit longer than expected to rig it up due to some issues with the builders today. I wanted a clean build with limited changes to alternate between.

Got a device cycling between commit "0b3cdb96ab5642bdb1ae79c5561d3efe" with buildID "2673622" and commit "cffbb44d30062e7b56ab5f5c73fd20f1" with buildID "2674797" every 15m (30m for the complete cylce) with a quick basic health check in between.

If /etc/passwd is invalid like it was the healthcheck (http check) will fail and the loop exit. Other filesystem issues though won't be detected as checking those would be more difficult.

If this passes after a couple days I'll do a commit that changes /etc/passwd in a minor but valid way (perhaps the issue only occurs if the file is modified) say add an extra user. And test alternating between those releases.

splitice commented 1 year ago

A quick look at the numbers involved.

We have done 4,327 builds since the start of our usage of balena, not every one has been successul but still alot (wow). Fleet size has averaged at around 5 in development (the production fleet gets only limited stable updates) and have likely done in excess of 20,000 delta updates. And we have only seen 3-5 filesystem failures (and only once has it been /etc/passwd), possibly more if it can go unnoticed. It could be quite difficult with my propopsed method to replicate.

Thinking about the different (observed) instances of the issue that I can remember:

From what I can see theres the following possibilities:

I wonder if its something like the file (within the image) is opened and exclusively locked preventing the application of the delta?

Or perhaps something related the containers working layer and removed files.

I really don't know enough about the delta upgrade process to say.

At potentially 1/4,000 updates this is a very rare condition.