docker / cli

The Docker CLI
Apache License 2.0
4.94k stars 1.93k forks source link

docker changes the image shasum while saving it #5515

Open nx2804 opened 1 month ago

nx2804 commented 1 month ago

Description

scenario docker save -o tarfilename during everytime when we try to save the same image docker is modifying the shasum values , instead the sha values should be identical

Reproduce

docker save -o tarfilename imagename:tagname again try to save the same image with tarfilename1 execute shasum tarfilename shasum tarfilename1

the sha values will be different

Expected behavior

No response

docker version

docker version 1.24.6

docker info

docker version 1.24.6

Additional Info

No response

thaJeztah commented 1 month ago

I think this may be a limitation of the "graphdriver" image store in the docker engine. The graphdriver store was designed to be optimnized for local disk consumption. As part of that, images pulled from a registry are extracted after they are pulled, after which the compressed layers are discarded, and only the extracted layers, as well as information about the pulled layers are preserved.

When saving pushing an image to the same registry, these layers, as well as the related "image manifests" are reconstructed, but this part is not reproducible (due to both compression artifacts as well as timestamps included in image manifest metadata).

That said; I tried to see what differences are between the saved files, and ... honestly, couldn't immediately find any; possible reasons could be the order in which files are included in the tar header, but they seem to be identical in every other way;

docker pull alpine
Using default tag: latest
latest: Pulling from library/alpine
Digest: sha256:beefdbd8a1da6d2915566fde36db9db0b524eb737fc57cd1367effd16dc0d06d
Status: Downloaded newer image for alpine:latest
docker.io/library/alpine:latest

docker image save -o one.tar
docker image save -o two.tar

shasum one.tar
05ee3ff4ae600438a025ab12339395bdc94dfa85  one.tar

shasum two.tar
1b74f13ee5f67bc8345d0d4cd1e70119c3990feb  two.tar

tar --xattrs -tvf one.tar
drwxr-xr-x  0/0               0 2024-09-06 22:20 blobs/
drwxr-xr-x  0/0               0 2024-10-08 11:48 blobs/sha256/
-rw-r--r--  0/0             401 1970-01-01 00:00 blobs/sha256/309ff318b44b4f2af442a37a269a93ce6907d277d2c168d3160f36cc802f8838
-rw-r--r--  0/0         8081920 2024-09-06 22:20 blobs/sha256/63ca1fbb43ae5034640e5e6cb3e083e05c290072c5366fcaa9d62435a4cced85
-rw-r--r--  0/0            1143 2024-09-06 22:20 blobs/sha256/6ad8fd5c38430e1ab05f033c689994934a216c1a7481aeb44de1239d7ca82f77
-rw-r--r--  0/0            1471 2024-09-06 22:20 blobs/sha256/91ef0af61f39ece4d6710e465df5ed6ca12112358344fd51ae6a3b886634148b
-rw-r--r--  0/0             362 2024-10-08 11:48 index.json
-rw-r--r--  0/0             457 1970-01-01 00:00 manifest.json
-rw-r--r--  0/0              31 1970-01-01 00:00 oci-layout
-rw-r--r--  0/0              89 1970-01-01 00:00 repositories

tar --xattrs -tvf two.tar
drwxr-xr-x  0/0               0 2024-09-06 22:20 blobs/
drwxr-xr-x  0/0               0 2024-10-08 11:48 blobs/sha256/
-rw-r--r--  0/0             401 1970-01-01 00:00 blobs/sha256/309ff318b44b4f2af442a37a269a93ce6907d277d2c168d3160f36cc802f8838
-rw-r--r--  0/0         8081920 2024-09-06 22:20 blobs/sha256/63ca1fbb43ae5034640e5e6cb3e083e05c290072c5366fcaa9d62435a4cced85
-rw-r--r--  0/0            1143 2024-09-06 22:20 blobs/sha256/6ad8fd5c38430e1ab05f033c689994934a216c1a7481aeb44de1239d7ca82f77
-rw-r--r--  0/0            1471 2024-09-06 22:20 blobs/sha256/91ef0af61f39ece4d6710e465df5ed6ca12112358344fd51ae6a3b886634148b
-rw-r--r--  0/0             362 2024-10-08 11:48 index.json
-rw-r--r--  0/0             457 1970-01-01 00:00 manifest.json
-rw-r--r--  0/0              31 1970-01-01 00:00 oci-layout
-rw-r--r--  0/0              89 1970-01-01 00:00 repositories

I think switching to the containerd image store may help here; when using the containerd image store ("snapshotters"), pulled images, including their compressed layers, are keept, and the exported tar looks to be fully reproducible;

docker pull alpine
Using default tag: latest
latest: Pulling from library/alpine
Digest: sha256:beefdbd8a1da6d2915566fde36db9db0b524eb737fc57cd1367effd16dc0d06d
Status: Downloaded newer image for alpine:latest
docker.io/library/alpine:latest

docker save -o c8d-one.tar alpine:latest
docker save -o c8d-two.tar alpine:latest

shasum c8d-one.tar
b4d8c4f578be934ad2c0a82f7efd184cf027d27f  c8d-one.tar

shasum c8d-two.tar
b4d8c4f578be934ad2c0a82f7efd184cf027d27f  c8d-two.tar
thaJeztah commented 1 month ago

If you have an environment to test on, it's worth switching to the containerd image store (which also provides support for storing multi-arch images);

Be aware though that switching the store switches to a different location for storing images and containers; your existing images won't be deleted, but won't be accessible (but still consume space). If possible, my recommendation is to remove content (containers, images) before switching.

nx2804 commented 1 month ago

Thanks for your response what is the default storage driver used in docker

nx2804 commented 1 month ago

can i switch containerd configuration to use the same storage driver used by docker

thaJeztah commented 1 month ago

Docker (without the containers image store) selects the default storage driver based on the underlying filesystem. In most cases that is overlay2.

When using the containerd image store, no detection is done currently, but the default will be the overlayfs snapshotter (storage driver), which is the equivalent to overlay2 (both use the kernel's "OverlayFS")

stevvooe commented 1 month ago

I've reproduced the issue:

❯ docker save -o one.tar.gz debian:latest
❯ docker save -o two.tar.gz debian:latest
❯ wc -c one.tar.gz two.tar.gz
 143606272 one.tar.gz
 143606272 two.tar.gz
 287212544 total
❯ shasum one.tar.gz two.tar.gz
d068d04161345aa5693859dbfc6015913fdd8af7  one.tar.gz
930e62e8f0cd9a24af709107f0b199ff87e570be  two.tar.gz

On first pass, the metadata looks the same:

❯ shasum <(tar tvf one.tar.gz) <(tar tvf two.tar.gz)
cf795d491009e668091a1a13d83b949d00a80073  /dev/fd/14
cf795d491009e668091a1a13d83b949d00a80073  /dev/fd/15

However, if we look at the binary records, there is a clear difference:

❯ diff -ru <(hexdump -C one.tar.gz) <(hexdump -C two.tar.gz)
--- /dev/fd/14  2024-10-08 12:55:38
+++ /dev/fd/15  2024-10-08 12:55:56
@@ -20,7 +20,7 @@
 00000260  00 00 00 00 30 30 30 30  37 35 35 00 30 30 30 30  |....0000755.0000|
 00000270  30 30 30 00 30 30 30 30  30 30 30 00 30 30 30 30  |000.0000000.0000|
 00000280  30 30 30 30 30 30 30 00  31 34 37 30 31 33 30 36  |0000000.14701306|
-00000290  30 31 36 00 30 31 31 33  35 33 00 20 35 00 00 00  |016.011353. 5...|
+00000290  30 32 34 00 30 31 31 33  35 32 00 20 35 00 00 00  |024.011352. 5...|
 000002a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
 *
 00000300  00 75 73 74 61 72 00 30  30 00 00 00 00 00 00 00  |.ustar.00.......|
@@ -6857398,7 +6857398,7 @@
 088f2e60  00 00 00 00 30 30 30 30  36 34 34 00 30 30 30 30  |....0000644.0000|
 088f2e70  30 30 30 00 30 30 30 30  30 30 30 00 30 30 30 30  |000.0000000.0000|
 088f2e80  30 30 30 30 35 35 32 00  31 34 37 30 31 33 30 36  |0000552.14701306|
-088f2e90  30 31 36 00 30 31 31 32  34 36 00 20 30 00 00 00  |016.011246. 0...|
+088f2e90  30 32 34 00 30 31 31 32  34 35 00 20 30 00 00 00  |024.011245. 0...|
 088f2ea0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
 *
 088f2f00  00 75 73 74 61 72 00 30  30 00 00 00 00 00 00 00  |.ustar.00.......|

The difference seem to map two the timestamp field and a checksum that follows. In both cases we change from 14701306016 to 14701306024. These look weird as timestamps but they are way in the future. However, because tar is a fun format, these are stored in octal ASCII. Decoding them, they map to an Oct 8th date. Let's have a look at the tar listing:

❯ tar tvf one.tar.gz
drwxr-xr-x  0 0      0           0 Jul  1 17:39 blobs/
drwxr-xr-x  0 0      0           0 Oct  8 12:46 blobs/sha256/
-rw-r--r--  0 0      0         403 Dec 31  1969 blobs/sha256/c89edf5050f4db4a7ac20a64bdb77f7ddca76dfc2c87a39fddca419084dca080
-rw-r--r--  0 0      0   143594496 Jul  1 17:39 blobs/sha256/d1660adccd2b42ad0160cba9a291ef75a87223577240a585a7f1cb90676ec3b8
-rw-r--r--  0 0      0        1152 Jul  1 17:39 blobs/sha256/d5156a0989b7b62fd13b9f28e7e1864554ae6b47657a2efc503b097818653cad
-rw-r--r--  0 0      0        1477 Jul  1 17:39 blobs/sha256/f753e4d18c7075845e84d759f49d57529f268aa7a262b517fd9f3d62749890eb
-rw-r--r--  0 0      0         362 Oct  8 12:46 index.json
-rw-r--r--  0 0      0         459 Dec 31  1969 manifest.json
-rw-r--r--  0 0      0          31 Dec 31  1969 oci-layout
-rw-r--r--  0 0      0          89 Dec 31  1969 repositories

From here, we can see that index.json and blobs/sha256 are generated with the current time as the timestamp on these tar header records. There can be a few causes of that but we should be able to track it down.

Here's some of my info:

Server: Docker Desktop 4.33.0 (159291)
 Engine:
  Version:          27.0.3
  API version:      1.46 (minimum version 1.24)
  Go version:       go1.21.11
  Git commit:       662f78c
  Built:            Sat Jun 29 00:02:44 2024
  OS/Arch:          linux/arm64
  Experimental:     false
 containerd:
  Version:          1.7.18
  GitCommit:        ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
 runc:
  Version:          1.7.18
  GitCommit:        v1.1.13-0-g58aa920
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Note that I do not have the containerd snapshotter enabled (I should though ;) ), so this is just from the overlay2 graphdriver. I don't remember if this is in graphdriver or not but it likely is.

As a matter of course, this really isn't a cli bug but we can likely fix it in moby. We should declare whether or not the docker save command is hash stable.

stevvooe commented 1 month ago

Ok, breaking this down to make the fix easier. We have two bugs:

  1. The first happens when we write index.json in https://github.com/moby/moby/blob/master/image/tarexport/save.go#L385. This needs to have a system.Chtimes call that follows it. Fairly straightforward fix.
  2. The second problem happens when the blob directory is created using os.MkdirAll: https://github.com/moby/moby/blob/master/image/tarexport/save.go#L263. That will naively create intermediary path components with the creation time of the local machine. We need to walk back up to the index root and set those timestamps correctly.
nx2804 commented 1 month ago

let me once this feature is merged and available