Error: layers from manifest don't match image configuration

karaxuna commented 3 years ago

Description

Image download fails with error:

Failed to download image 'registry2.balena-cloud.com/v2/9bc400dcc9a75bc3be299a20d38e9c76@sha256:1709f3810eaecf4817bd419c2228b8218d4538fa1d208c4915cc0cde49282cc9' due to 'layers from manifest don't match image configuration'

Output of balena-engine version:

Client:
 Version:           19.03.13-dev
 API version:       1.40
 Go version:        go1.12.17
 Git commit:        074a481789174b4b6fd2d706086e8ffceb72e924
 Built:             Sun Aug 16 10:33:45 2020
 OS/Arch:           linux/arm
 Experimental:      false

Server:
 Engine:
  Version:          19.03.13-dev
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.17
  Git commit:       074a481789174b4b6fd2d706086e8ffceb72e924
  Built:            Sun Aug 16 10:33:45 2020
  OS/Arch:          linux/arm
  Experimental:     true
 containerd:
  Version:          1.2.0+unknown
  GitCommit:        
 runc:
  Version:          
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 balena-engine-init:
  Version:          0.13.0
  GitCommit:        949e6fa-dirty

Output of balena-engine info:

Client:
 Debug Mode: false

Server:
 Containers: 2
  Running: 2
  Paused: 0
  Stopped: 0
 Images: 3
 Server Version: 19.03.13-dev
 Storage Driver: aufs
  Root Dir: /var/lib/docker/aufs
  Backing Filesystem: extfs
  Dirs: 37
  Dirperm1 Supported: true
 Logging Driver: journald
 Cgroup Driver: systemd
 Plugins:
  Volume: local
  Network: bridge host null
  Log: journald json-file local
 Swarm: 
  NodeID: 
  Is Manager: false
  Node Address: 
 Runtimes: bare runc
 Default Runtime: runc
 Init Binary: balena-engine-init
 containerd version: 
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: 949e6fa-dirty (expected: fec3683b971d9)
 Kernel Version: 4.19.118
 Operating System: balenaOS 2.58.3+rev1
 OSType: linux
 Architecture: armv7l
 CPUs: 4
 Total Memory: 745.6MiB
 Name: 81e8074
 ID: X4GL:EF4Y:PES6:2IP6:K2HI:T67S:6XJX:7XP3:FEYW:GTJZ:NUGM:3XCU
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: true
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: the aufs storage-driver is deprecated, and will be removed in a future release.

Additional environment details (device type, OS, etc.): Device type: Raspberry Pi 3 OS version: balenaOS 2.58.3+rev1

jellyfish-bot commented 3 years ago

[karaxuna] This issue has attached support thread https://jel.ly.fish/b0d1ee09-ed9c-4574-8b55-990e0e29d714

jellyfish-bot commented 3 years ago

[karaxuna] This issue has attached support thread https://jel.ly.fish/6f6ca533-fefe-4971-a78e-bc2863209c07

robertgzr commented 3 years ago

this sounds like something went wrong on the delta generation side of things... the reason seems to be that the contents of the io.resin.delta.config label on the delta didn't match with the delta contents.

robertgzr commented 3 years ago

thinking about this some more... maybe https://github.com/balena-os/balena-engine/pull/231 is related

robertgzr commented 3 years ago

although I would expect generation to fail if layers are removed from under us

jellyfish-bot commented 3 years ago

[nazrhom] This issue has attached support thread https://jel.ly.fish/28041686-44b7-43a6-836b-3bbaccf890ee

jellyfish-bot commented 3 years ago

[anujdeshpande] This issue has attached support thread https://jel.ly.fish/c0c9841a-9ebb-4987-b3b4-295850757dfc

jellyfish-bot commented 3 years ago

[gelbal] This issue has attached support thread https://jel.ly.fish/6e01c3c7-b272-4bfe-8d86-53c51a335fda

cywang117 commented 3 years ago

When this error occurs on a device running Supervisor < v12.9.6, check for outdated entries in the image table of the Supervisor database using the following steps:

balena exec -it resin_supervisor node
`sqlite3 = require('sqlite3')
db = new sqlite3.Database('/data/database.sqlite')
db.all('select * from image',console.log)

If there are any entries which have different releaseId fields than other entries, and the Supervisor is erroring with delta updates, it's likely that the error message is appearing because the Supervisor is pulling from an incorrect delta source due to ambiguous or duplicate values for image dockerId. In that case, the message is not because of any engine issue. You may clean up the database's image table with db.all('delete from image where releaseId != <TARGET_RELEASE_ID>',console.log) to see if the message disappears. This was fixed with Supervisor v12.9.6 and an upgrade to a version >= that is recommended. See: https://github.com/balena-os/balena-supervisor/pull/1749

EDIT: Note that this will remove the inconsistencies in the Supervisor's database's image table, removing one variable which may influence the device negatively, however it is not necessarily the root cause of this issue.

jellyfish-bot commented 2 years ago

[rhampt] This issue has attached support thread https://jel.ly.fish/9dcee238-ad96-40e0-a850-0e3c1ac15db6

dfunckt commented 2 years ago

@robertgzr this is unlikely to be related to deltas after all.

Here are engine logs from the delta server while pulling either of the following two images:

registry2.balena-cloud.com/v2/d76730821fe314dd54b7144e197a94e6@sha256:63c5a57a1c541b7d436fe12053215c3e9ebf79d86f0e2c73b9c844d569afc2b1 or
registry2.balena-cloud.com/v2/12d6fc754a166233afbd395f54e938b5@sha256:43164eb17d96b919c2b2cf706bc16f1ba7911fff526c1a4520791bbf59254b8d

Not sure which one yet.

time="2022-02-02T13:45:30.375069034Z" level=info msg="Attempting next endpoint for pull after error: layers from manifest don't match image configuration"
time="2022-02-02T13:45:30.376065920Z" level=info msg="Layer sha256:a8ef03afc67322ed4175d210833862e9dbf9584e9be8c4c7c81b4c47ae01b5dc cleaned up"
time="2022-02-02T13:45:30.381348734Z" level=info msg="Layer sha256:1aedb6eaddf61f9592521a9b6a6e4b4dc228ffc4f5894befb6b98b8d62539015 cleaned up"
time="2022-02-02T13:45:30.381370527Z" level=info msg="Layer sha256:54a7233bd24c6d472753bec3745a031452d02ed85c4515222e4ee9782abb8712 cleaned up"
time="2022-02-02T13:45:30.555155296Z" level=info msg="Layer sha256:3d47ecad9fb9871644467ba27321d80c1d40a3991ee2e23a0ea3e82c5201135f cleaned up"
time="2022-02-02T13:45:30.555194589Z" level=info msg="Layer sha256:85e3abd8921714b9a2b959c457a2719df2c393a5e162f663258cfb0ab24b7ca6 cleaned up"
time="2022-02-02T13:45:30.555205731Z" level=info msg="Layer sha256:eb40245bb869b00307b3ee72c81070438f8cea095f71e33f0b94536b6e4cc7a5 cleaned up"
time="2022-02-02T13:45:30.555215603Z" level=info msg="Layer sha256:4e5c8cf3a8836044946c4f319a801e93fe107965a36ea2d9397cd53da984d86e cleaned up"
time="2022-02-02T13:45:30.555225733Z" level=info msg="Layer sha256:66222f16384e0c7d40aa8e9dd4a05d031957ee5bad3481644c86e83cd936d1d3 cleaned up"
time="2022-02-02T13:45:30.555234980Z" level=info msg="Layer sha256:f95049f369be6903a76374b9e2ee603fea6e94ddbff869cd3c672602b134bbc8 cleaned up"

dfunckt commented 2 years ago

Might even not be related to the engine at all -- digging a little deeper, in the last 30 days (where server logs cut off) the first occurence of this error is today 3 hours ago (02 Feb 2022 13:45:30.408 UTC).

lmbarros commented 2 years ago

This thread (also found by Akis) seems relevant: https://github.com/distribution/distribution/issues/1439

jellyfish-bot commented 2 years ago

[phil-d-wilson] This issue has attached support thread https://jel.ly.fish/e8dbc65a-6af1-424a-aaa0-ede4168861b2

jellyfish-bot commented 2 years ago

[cywang117] This issue has attached support thread https://jel.ly.fish/c07b4332-0580-438d-9625-54b54ec3531a

cywang117 commented 2 years ago

In the above JF link, the user is getting a handful of devices that fail to pull a delta with this error, with the majority of the rest of their fleet moving to the target release without issue (i.e. some devices can pull the delta successfully). Therefore the issue may originate from somewhere other than when the image is pushed. The user is running Engine 19.03.29 however, so no additional debug logs from 20.10.16 are available.

EDIT: After purging the Docker directory, the delta pull succeeded, so perhaps some engine data is corrupt. I wonder if there's a less extreme fix available..

jellyfish-bot commented 2 years ago

[pipex] This has attached https://jel.ly.fish/c892bceb-7c31-4f96-9d37-c97e9c2d9845

jellyfish-bot commented 2 years ago

[anujdeshpande] This has attached https://jel.ly.fish/fecea7eb-d1f8-4085-80f4-3a988b543a33

jellyfish-bot commented 2 years ago

[thgreasi] This has attached https://jel.ly.fish/97ff6771-fdc0-43b2-9bb4-53b3aaa0ff8a

jellyfish-bot commented 2 years ago

[cywang117] This has attached https://jel.ly.fish/c31720c9-11b4-488b-bee6-8e8140c5d3bc

lmbarros commented 1 year ago

Moving into here some debugging notes I was keeping elsewhere.

Findings after working on ticket (1)

1) We can confirm this is not a problem with the delta image itself. Everything makes me believe this is some corruption (caused by a bug, not by bad SD card or anything) on the Engine persistent state. 2) The error is being triggered here. 3) In this case, the only mismatch was in the first layer:

configRootFS.DiffIDs[0] == sha256:aaaaaaa (this is coming from the io.resin.delta.config label)
downloadedRootFS.DiffIDs[0] == sha256:bbbbbbb (which seems to be ID of the delta layer itself, before it is applied). 4) How we got into this state is not clear (and is the key to understand the issue). 5) On the device, I found this file: /var/lib/docker/image/overlay2/distribution/v2metadata-by-diffid/sha256/bbbbbbb, which I wasn't expecting to be there -- I don't see it when downloading the base and then the delta in another device. (Again, this is pointing to the delta layer itself.) Not sure how (or if) this is relevant, but it may be part of the corruption we are looking for.

Trying to form an hypothesis from the findings above

Maybe this is happening:

We download a delta layer as part of the app update.
- Normally, we'd apply the delta immediately, that's the expected Engine behavior (pull delta == apply delta).
- But, for some reason, applying the delta somehow fails.
- We end up with the delta layer on disk. On a normal delta pull, we'd not have the delta layer on disk, but on one affected device the layer was found on disk.
This would match the logs in the case I analyzed above, AFAIU: the real image metadata (from the io.resin.delta.config label points to the target layer, but we are finding the delta layer instead.
Any subsequent attempts to repeat the update would keep failing with the same error.
- Because if the delta layer is already found on disk, we'll not try to pull it again. And if we don't pull it again, the delta never gets applied.
This would also explain why only some devices of a fleet are affected.

Findings after working on ticket (2)

Largely similar to ticket (1): failing at the same point, with a mismatch in only one of the layers. Some major differences, though:

The image causing error was not a delta.
- This invalidates my previous hypothesis.
Manually pulling the image still solved the issue.
- Which begs the question: wasn't this exactly the same thing the Supervisor was trying to do? Why did the manual pull work and the Supervisor pull didn't? (And I think this question is relevant also for the delta case.)

jellyfish-bot commented 1 year ago

[pipex] This has attached https://jel.ly.fish/63acd8fd-ca46-4b32-a045-7fc44e53f63c

jellyfish-bot commented 1 year ago

[cywang117] This has attached https://jel.ly.fish/3a6c2646-fcbc-4039-a6dc-8e40c99114d9

jellyfish-bot commented 1 year ago

[klutchell] This has attached https://jel.ly.fish/fc0b2a4b-aec6-48fa-be54-495787ed90f4

jellyfish-bot commented 1 year ago

[lmbarros] This has attached https://jel.ly.fish/4635b048-35c4-4588-a140-95a316194880

jellyfish-bot commented 1 year ago

[lmbarros] This has attached https://jel.ly.fish/f67c5c56-000b-4924-aa5c-960a83195725

lmbarros commented 1 year ago

I have seen another case today. In this one, the device was short of space on the data partition. I wasn't able to check the device while the error was happening and neither the logs were available, but I noticed that there wasn't enough space to pull the whole image (and the docker-compose.yml was not overriding the default update policy).

Given that manual pulls (without deleting the current image first) have worked around the issue in the past, it seems safe to say that a "disk full" condition is not the trigger for the issue, but perhaps it can be a trigger.

lmbarros commented 1 year ago

Looking further on the case from my last comment, now that we could get the logs. First impressions, looks like the same thing as I documented earlier on for previous cases. Error being raised on the same spot. In this case it was a 75-layer image, with one mismatch when i=64 (zero-based indexing, so 65th layer).

lmbarros commented 1 year ago

Another data point from the case above. From those 75 layers (6.6GB), only about 20 or 30 of them (202MB) were new. All the other layers were shared with the old version of the image that was already present on the device. For the previous cases we didn't take note if there were shared layers or not, and I am not implying that this is related with the issue. Just another detail to keep our eye on in future occurrences, and see if a pattern emerges.

lmbarros commented 1 year ago

Oh, and by the way, the short-of-disk-space condition I speculated about earlier is probably bogus: because there were so many shared layers, the space available on disk was large enough to easily keep both versions.

jellyfish-bot commented 1 year ago

[pipex] This has attached https://jel.ly.fish/b0a5afdd-1cf8-4d74-86fb-e6b8dce63684

cywang117 commented 9 months ago

I encountered a device on support where this error message was "manifest"ing (excuse the pun) while a device was updating between releases A & B. I tried an experiment which happened to resolve, which did not involve rebooting or restarting anything. Below is the steps I took; hopefully it'll aid in future investigations of this issue:

Stop service & remove base image from release A (i.e. source image, not target image) of service that was erroring when Supervisor was pulling delta
Manually pull the base image directly from balena registry
Restart Supervisor, wait for SV to attempt delta pull to release B again
Observe that delta pull was successful this time without manifesting the error

The user pinned to 4 different target releases in a short time frame -- could it be that the base image manifest got corrupt from multiple attempts from the Engine to interact with it (race condition), due to the 4 pins in quick succession?

vipulgupta2048 commented 8 months ago

@cywang117 The solution you provided did resolve the issue for the customer, reporting as requested.

lmbarros commented 6 months ago

Thodoris has noticed a case which involved two images with the same contents (i.e., different tags for the same content hash). We didn't test this hypothesis further yet, but what if two simultaneous pulls of the same image (a delta, in this case), could somehow trigger the issue?

balena-os / balena-engine

Error: layers from manifest don't match image configuration #244