Open karaxuna opened 3 years ago
[karaxuna] This issue has attached support thread https://jel.ly.fish/b0d1ee09-ed9c-4574-8b55-990e0e29d714
[karaxuna] This issue has attached support thread https://jel.ly.fish/6f6ca533-fefe-4971-a78e-bc2863209c07
this sounds like something went wrong on the delta generation side of things...
the reason seems to be that the contents of the io.resin.delta.config
label on the delta didn't match with the delta contents.
thinking about this some more... maybe https://github.com/balena-os/balena-engine/pull/231 is related
although I would expect generation to fail if layers are removed from under us
[nazrhom] This issue has attached support thread https://jel.ly.fish/28041686-44b7-43a6-836b-3bbaccf890ee
[anujdeshpande] This issue has attached support thread https://jel.ly.fish/c0c9841a-9ebb-4987-b3b4-295850757dfc
[gelbal] This issue has attached support thread https://jel.ly.fish/6e01c3c7-b272-4bfe-8d86-53c51a335fda
When this error occurs on a device running Supervisor < v12.9.6, check for outdated entries in the image table of the Supervisor database using the following steps:
balena exec -it resin_supervisor node
db = new sqlite3.Database('/data/database.sqlite')
db.all('select * from image',console.log)
If there are any entries which have different releaseId
fields than other entries, and the Supervisor is erroring with delta updates, it's likely that the error message is appearing because the Supervisor is pulling from an incorrect delta source due to ambiguous or duplicate values for image dockerId
. In that case, the message is not because of any engine issue. You may clean up the database's image table with db.all('delete from image where releaseId != <TARGET_RELEASE_ID>',console.log)
to see if the message disappears. This was fixed with Supervisor v12.9.6 and an upgrade to a version >= that is recommended. See: https://github.com/balena-os/balena-supervisor/pull/1749
EDIT: Note that this will remove the inconsistencies in the Supervisor's database's image table, removing one variable which may influence the device negatively, however it is not necessarily the root cause of this issue.
[rhampt] This issue has attached support thread https://jel.ly.fish/9dcee238-ad96-40e0-a850-0e3c1ac15db6
@robertgzr this is unlikely to be related to deltas after all.
Here are engine logs from the delta server while pulling either of the following two images:
registry2.balena-cloud.com/v2/d76730821fe314dd54b7144e197a94e6@sha256:63c5a57a1c541b7d436fe12053215c3e9ebf79d86f0e2c73b9c844d569afc2b1
orregistry2.balena-cloud.com/v2/12d6fc754a166233afbd395f54e938b5@sha256:43164eb17d96b919c2b2cf706bc16f1ba7911fff526c1a4520791bbf59254b8d
Not sure which one yet.
time="2022-02-02T13:45:30.375069034Z" level=info msg="Attempting next endpoint for pull after error: layers from manifest don't match image configuration"
time="2022-02-02T13:45:30.376065920Z" level=info msg="Layer sha256:a8ef03afc67322ed4175d210833862e9dbf9584e9be8c4c7c81b4c47ae01b5dc cleaned up"
time="2022-02-02T13:45:30.381348734Z" level=info msg="Layer sha256:1aedb6eaddf61f9592521a9b6a6e4b4dc228ffc4f5894befb6b98b8d62539015 cleaned up"
time="2022-02-02T13:45:30.381370527Z" level=info msg="Layer sha256:54a7233bd24c6d472753bec3745a031452d02ed85c4515222e4ee9782abb8712 cleaned up"
time="2022-02-02T13:45:30.555155296Z" level=info msg="Layer sha256:3d47ecad9fb9871644467ba27321d80c1d40a3991ee2e23a0ea3e82c5201135f cleaned up"
time="2022-02-02T13:45:30.555194589Z" level=info msg="Layer sha256:85e3abd8921714b9a2b959c457a2719df2c393a5e162f663258cfb0ab24b7ca6 cleaned up"
time="2022-02-02T13:45:30.555205731Z" level=info msg="Layer sha256:eb40245bb869b00307b3ee72c81070438f8cea095f71e33f0b94536b6e4cc7a5 cleaned up"
time="2022-02-02T13:45:30.555215603Z" level=info msg="Layer sha256:4e5c8cf3a8836044946c4f319a801e93fe107965a36ea2d9397cd53da984d86e cleaned up"
time="2022-02-02T13:45:30.555225733Z" level=info msg="Layer sha256:66222f16384e0c7d40aa8e9dd4a05d031957ee5bad3481644c86e83cd936d1d3 cleaned up"
time="2022-02-02T13:45:30.555234980Z" level=info msg="Layer sha256:f95049f369be6903a76374b9e2ee603fea6e94ddbff869cd3c672602b134bbc8 cleaned up"
Might even not be related to the engine at all -- digging a little deeper, in the last 30 days (where server logs cut off) the first occurence of this error is today 3 hours ago (02 Feb 2022 13:45:30.408 UTC).
This thread (also found by Akis) seems relevant: https://github.com/distribution/distribution/issues/1439
[phil-d-wilson] This issue has attached support thread https://jel.ly.fish/e8dbc65a-6af1-424a-aaa0-ede4168861b2
[cywang117] This issue has attached support thread https://jel.ly.fish/c07b4332-0580-438d-9625-54b54ec3531a
In the above JF link, the user is getting a handful of devices that fail to pull a delta with this error, with the majority of the rest of their fleet moving to the target release without issue (i.e. some devices can pull the delta successfully). Therefore the issue may originate from somewhere other than when the image is pushed. The user is running Engine 19.03.29 however, so no additional debug logs from 20.10.16 are available.
EDIT: After purging the Docker directory, the delta pull succeeded, so perhaps some engine data is corrupt. I wonder if there's a less extreme fix available..
[pipex] This has attached https://jel.ly.fish/c892bceb-7c31-4f96-9d37-c97e9c2d9845
[anujdeshpande] This has attached https://jel.ly.fish/fecea7eb-d1f8-4085-80f4-3a988b543a33
[thgreasi] This has attached https://jel.ly.fish/97ff6771-fdc0-43b2-9bb4-53b3aaa0ff8a
[cywang117] This has attached https://jel.ly.fish/c31720c9-11b4-488b-bee6-8e8140c5d3bc
Moving into here some debugging notes I was keeping elsewhere.
Findings after working on ticket (1)
1) We can confirm this is not a problem with the delta image itself. Everything makes me believe this is some corruption (caused by a bug, not by bad SD card or anything) on the Engine persistent state. 2) The error is being triggered here. 3) In this case, the only mismatch was in the first layer:
configRootFS.DiffIDs[0] == sha256:aaaaaaa
(this is coming from the io.resin.delta.config
label)downloadedRootFS.DiffIDs[0] == sha256:bbbbbbb
(which seems to be ID of the delta layer itself, before it is applied).
4) How we got into this state is not clear (and is the key to understand the issue).
5) On the device, I found this file: /var/lib/docker/image/overlay2/distribution/v2metadata-by-diffid/sha256/bbbbbbb
, which I wasn't expecting to be there -- I don't see it when downloading the base and then the delta in another device. (Again, this is pointing to the delta layer itself.) Not sure how (or if) this is relevant, but it may be part of the corruption we are looking for.Trying to form an hypothesis from the findings above
Maybe this is happening:
io.resin.delta.config
label points to the target layer, but we are finding the delta layer instead.Findings after working on ticket (2)
Largely similar to ticket (1): failing at the same point, with a mismatch in only one of the layers. Some major differences, though:
[pipex] This has attached https://jel.ly.fish/63acd8fd-ca46-4b32-a045-7fc44e53f63c
[cywang117] This has attached https://jel.ly.fish/3a6c2646-fcbc-4039-a6dc-8e40c99114d9
[klutchell] This has attached https://jel.ly.fish/fc0b2a4b-aec6-48fa-be54-495787ed90f4
[lmbarros] This has attached https://jel.ly.fish/4635b048-35c4-4588-a140-95a316194880
[lmbarros] This has attached https://jel.ly.fish/f67c5c56-000b-4924-aa5c-960a83195725
I have seen another case today. In this one, the device was short of space on the data partition. I wasn't able to check the device while the error was happening and neither the logs were available, but I noticed that there wasn't enough space to pull the whole image (and the docker-compose.yml
was not overriding the default update policy).
Given that manual pulls (without deleting the current image first) have worked around the issue in the past, it seems safe to say that a "disk full" condition is not the trigger for the issue, but perhaps it can be a trigger.
Looking further on the case from my last comment, now that we could get the logs. First impressions, looks like the same thing as I documented earlier on for previous cases. Error being raised on the same spot. In this case it was a 75-layer image, with one mismatch when i=64
(zero-based indexing, so 65th layer).
Another data point from the case above. From those 75 layers (6.6GB), only about 20 or 30 of them (202MB) were new. All the other layers were shared with the old version of the image that was already present on the device. For the previous cases we didn't take note if there were shared layers or not, and I am not implying that this is related with the issue. Just another detail to keep our eye on in future occurrences, and see if a pattern emerges.
Oh, and by the way, the short-of-disk-space condition I speculated about earlier is probably bogus: because there were so many shared layers, the space available on disk was large enough to easily keep both versions.
[pipex] This has attached https://jel.ly.fish/b0a5afdd-1cf8-4d74-86fb-e6b8dce63684
I encountered a device on support where this error message was "manifest"ing (excuse the pun) while a device was updating between releases A & B. I tried an experiment which happened to resolve, which did not involve rebooting or restarting anything. Below is the steps I took; hopefully it'll aid in future investigations of this issue:
The user pinned to 4 different target releases in a short time frame -- could it be that the base image manifest got corrupt from multiple attempts from the Engine to interact with it (race condition), due to the 4 pins in quick succession?
@cywang117 The solution you provided did resolve the issue for the customer, reporting as requested.
Thodoris has noticed a case which involved two images with the same contents (i.e., different tags for the same content hash). We didn't test this hypothesis further yet, but what if two simultaneous pulls of the same image (a delta, in this case), could somehow trigger the issue?
Description
Image download fails with error:
Output of
balena-engine version
:Output of
balena-engine info
:Additional environment details (device type, OS, etc.): Device type: Raspberry Pi 3 OS version: balenaOS 2.58.3+rev1