docker-library / official-images

Primary source of truth for the Docker "Official Images" program
https://hub.docker.com/u/library
Apache License 2.0
6.51k stars 2.37k forks source link

Reproducible builds #16044

Open AkihiroSuda opened 10 months ago

AkihiroSuda commented 10 months ago

Dockerfile PRs:

No PR is needed for the following repos:

meta-script PR:

I also talked about this at DockerCon 2023: https://medium.com/nttlabs/dockercon-2023-reproducible-builds-with-buildkit-for-software-supply-chain-security-0e5aedd1aaa7

Scope of reproduction

The digests of the image manifest blobs (and config blobs and layer blobs) should be reproducible. https://github.com/reproducible-containers/diffoci can be used for testing reproducibility.

SLSA provenance manifest blobs (enabled by default in recent buildx) are not reproducible by design, so the image index digest will not be reproducible.

Reproducing base images

Base images have to be pinned by the sha256 digest for reproduction.

A digest can be embedded in a FROM instruction of a Dockerfile. However, I wonder that image maintainers might not want to update Dockerfiles frequently to ensure the latest base image to be picked up during the upstream build.

In that case, we can just leave FROM instructions unpinned, and let reproducers to use the COVNERT action of source policies to dynamically replace the image identifier:

https://github.com/moby/buildkit/blob/v0.13.0-beta1/docs/build-repro.md

{
  "action": "CONVERT",
  "selector": {
    "identifier": "docker-image://docker.io/library/alpine:latest"
  },
  "updates": {
    "identifier": "docker-image://docker.io/library/alpine:latest@sha256:4edbd2beb5f78b1014028f4fbb99f3237d9561100b6881aabbf5acce2c4f9454"
  }
}

The digest to be used for the CONVERT action is recored in the SLSA provenance. We need to update buildx CLI to automatically generate CONVERT actions from SLSA provenances. (See the buildx CLI section below)

Reproducing package versions

Debian and Ubuntu

Debian and Ubuntu keep old packages on http://snapshot.debian.org and http://snapshot.ubuntu.com. I wrote a script to rewrite /etc/apt/sources.list to use those snapshot servers: https://github.com/reproducible-containers/repro-sources-list.sh/blob/master/repro-sources-list.sh The snapshot timestamp can be supplied via $SOURCE_DATE_EPOCH.

These snapshot servers are quite slow and sometimes flaky (especially for Debian), so, probably the snapshot servers shouldn't be used for the upstream builds.

See https://github.com/moby/buildkit/pull/4669 for how to rewrite /etc/apt/sources.list in downstream builds.

Alpine

Reproducing apk packages is still challenging, as Alpine does not have snapshot servers.

The long-term plan is to capture apk packages on building and attach them to the image as artifacts:

Reproducing file timestamps

BuildKit v0.13 supports rewriting the timestamps of the files inside image layers to use $SOURCE_DATE_EPOCH.

--output type=image,name=docker.io/username/image,push=true,rewrite-timestamp=true

https://github.com/moby/buildkit/blob/v0.13.0-beta1/docs/build-repro.md#source_date_epoch

Removal of logs, etc.

The entries in /var/cache/ldconfig/aux-cache are organised as an associative array, with the keys including file attributes like device number, inode number and inode change time. This means it is not only unreproducible, but completely useless at boot time since the device and inode numbers of libraries will be different.

https://linux.debian.kernel.narkive.com/7wfNAf7A/bug-845034-initramfs-tools-please-ensure-initrd-images-are-reproducible#post3

Reproducing file contents

Some dockerfiles will need extra work for reproducing file contents. e.g., sorting arrays, removing randomized mktemp, ...

e.g., in case of gcc:

diff -ur --no-dereference a/usr/local/lib64/libgo.la b/usr/local/lib64/libgo.la
--- a/usr/local/lib64/libgo.la  2024-01-12 18:14:56.000000000 +0900
+++ b/usr/local/lib64/libgo.la  2024-01-12 18:21:45.000000000 +0900
@@ -17,7 +17,7 @@
 inherited_linker_flags=' -pthread'

 # Libraries that this one depends upon.
-dependency_libs=' -L/tmp/tmp.LWUIKDJ22E/x86_64-linux-gnu/libatomic/.libs -lpthread -lm'
+dependency_libs=' -L/tmp/tmp.yeTnsy0FEm/x86_64-linux-gnu/libatomic/.libs -lpthread -lm'

 # Names of additional weak libraries provided by this library
 weak_library_names=''

Buildx CLI

Buildx CLI should be updated to allow attesting reproducibility with a few commands. Notably, buildx build should have a flag like --repro from=gcc@sha256@... to import build args and base image digests from an SLSA provenance:

$ # "none://" is a filler for the build context arg
$ docker buildx build \
  --load \
  -t gcc:local \
  --repro from=gcc@sha256:f97e2719cd5138c932a814ca43f3ca7b33fde866e182e7d76d8391ec0b05091f \
  none://
...
[amd64] Using SLSA provenance sha256:7ecde97c24ea34e1409caf6e91123690fa62d1465ad08f638ebbd75dd381f08f
[amd64] Importing Dockerfile blob embedded in the provenance
[amd64] Importing build context https://github.com/docker-library/gcc.git#af458ec8254ef7ca3344f12631e2356b20b4a7f1:13
[amd64] Importing build-arg SOURCE_DATE_EPOCH=1690467916
[amd64] Importing buildpack-deps:bookworm from docker-image://buildpack-deps:bookworm@sha256:bccdd9ebd8dbbb95d41bb5d9de3f654f8cd03b57d65d090ac330d106c87d7ed
...

$ diffoci diff gcc@sha256:f97e2719cd5138c932a814ca43f3ca7b33fde866e182e7d76d8391ec0b05091f gcc:local
...

CI

We will also need to have a CI to periodically attest reproducibility with the proposed CLI above.

sudo-bmitch commented 10 months ago

SLSA provenance manifest blobs (enabled by default in recent buildx) are not reproducible by design, so the image index digest will not be reproducible.

The index could be reproducible if the artifacts were associated using the upcoming OCI subject/referrers API rather than injecting it directly in the index.

A digest can be embedded in a FROM instruction of a Dockerfile. However, I wonder that image maintainers might not want to update Dockerfiles frequently to ensure the latest base image to be picked up during the upstream build.

In that case, we can just leave FROM instructions unpinned, and let reproducers to use the COVNERT action of source policies to dynamically replace the image identifier

The push from groups like Scorecard is to embed the digest directly in the Dockerfile and maintain the pin with a tool like renovate or dependabot. This gives reproducibility with controlled updates.

We will also need to have a CI to periodically attest reproducibility with the proposed CLI above.

I'd love to see work done on a user tool that can take any built image, or git commit, and verify the output. We can assist this work by annotating the images with details like the git commit used to build them and the SOURCE_DATE_EPOC value. This could allow an independent 3rd party verification of images, and signatures or attestations pushed to their repository as detached artifacts (artifacts referencing an image that is not in that repository). Users could then add a policy on their side to require a 3rd party attestation from a list of trusted rebuilders.

AkihiroSuda commented 10 months ago

The index could be reproducible if the artifacts were associated using the upcoming OCI subject/referrers API rather than injecting it directly in the index.

Yes, but I guess adoption of the spec v1.1 in DOI is likely to take years, as user's site-local mirrors are mostly not ready for v1.1 yet

The push from groups like Scorecard is to embed the digest directly in the Dockerfile and maintain the pin with a tool like renovate or dependabot. This gives reproducibility with controlled updates.

Yes, but I wonder that maintainers might not want to see tons of bump-up PRs to happen every day.

sudo-bmitch commented 10 months ago

The index could be reproducible if the artifacts were associated using the upcoming OCI subject/referrers API rather than injecting it directly in the index.

Yes, but I guess adoption of the spec v1.1 in DOI is likely to take years, as user's site-local mirrors are mostly not ready for v1.1 yet

OCI 1.0 conformant mirrors already support it by use of a fallback tag. It's similar to how sigstore implements their attestations today.

The push from groups like Scorecard is to embed the digest directly in the Dockerfile and maintain the pin with a tool like renovate or dependabot. This gives reproducibility with controlled updates.

Yes, but I wonder that maintainers might not want to see tons of bump-up PRs to happen every day.

There's a constant bump-up already. One is invisible and hurts reproducibility, while the other is documented with a Git commit.

AkihiroSuda commented 10 months ago

The first PR:

tianon commented 10 months ago

I don't have the bandwidth to respond to all of this right now, so I'll focus on the bit that concerns me the most: the service provided at snapshot.debian.org is not currently well-maintained or well-staffed (hence the speed issues, but it's a much deeper problem), so I would be very uncomfortable intentionally adding to them the load of all our image builds/rebuilds.

At a high level, I think the best place to start here (especially as it would be the least disruptive) is making useful/interesting layers reproducible. For example, https://github.com/docker-library/golang/commit/46f40bd7c5706e873f5177719159849baa95b275 is an old PoC that @yosifkit worked on which provides that for the Go-providing layer of the Go images that I've been planning to revisit in the near future.

AkihiroSuda commented 10 months ago

I don't have the bandwidth to respond to all of this right now, so I'll focus on the bit that concerns me the most: the service provided at snapshot.debian.org is not currently well-maintained or well-staffed (hence the speed issues, but it's a much deeper problem), so I would be very uncomfortable intentionally adding to them the load of all our image builds/rebuilds.

This proposal does NOT enable snapshot.debian.org for the upstream builds. Third-party reproducers may opt-in to use snapshot.debian.org by providing --secret id=enable-repro-sources-list,source=/dev/null to repro the packages that were on the regular debian.org at the time of the SOURCE_DATE_EPOCH. (This secret is not really "secret" but made as a secret to avoid affecting the history object in OCI Image Config blob)

Could you take a look at the PR again ?

At a high level, I think the best place to start here (especially as it would be the least disruptive) is making useful/interesting layers reproducible. For example, docker-library/golang@46f40bd is an old PoC that @yosifkit worked on which provides that for the Go-providing layer of the Go images that I've been planning to revisit in the near future.

In the case of httpd image, the useful/interesting layer is the layer of the binaries such as /usr/local/bin/httpd. To reproduce the binaries we have to reproduce at least the compiler packages.

AkihiroSuda commented 10 months ago

BTW snapshot.ubuntu.com hosted on Azure seems quite fast enough to adopt as the default. https://ubuntu.com/blog/ubuntu-snapshots-on-azure-ensuring-predictability-and-consistency-in-cloud-deployments

Would it be acceptable to add Ubuntu variants to DOI? This will be also beneficial for Ubuntu users who are not interested in reproducible builds

AkihiroSuda commented 10 months ago

@tianon Could you take a look? ๐Ÿ™

codethief commented 10 months ago

I'm only an interested bystander but I've long been looking for reproducible images, so that, using them as base images, I can create reproducible images for my projects, too.

@AkihiroSuda Would your proposal also include publishing those images on the daily, tagged by date (i.e. corresponding to the snapshot package sources), like @tianon currently does for Debian on a ~monthly basis?

This would be incredibly useful downstream as one could then simply bump the date tag of the base image on the daily to upgrade dependencies & protect against vulnerabilities while still preserving reproducibility.

AkihiroSuda commented 10 months ago

@AkihiroSuda Would your proposal also include publishing those images on the daily, tagged by date (i.e. corresponding to the snapshot package sources), like @tianon currently does for Debian on a ~monthly basis?

This is an orthogonal topic.

jan-kiszka commented 9 months ago

FWIW, I've went through the not-yet-fully-smooth process of making the kas build containers for Yocto and Isar reproducible (https://github.com/siemens/kas/commits/next). Our containers as based on Debian, and it would be great to see our dependencies reproducible as well.

BTW, I had some fun with understanding differences due to cached layers that are apparently not rewritten timestamp-wise. Is that a known issue of BuildKit? I'm now refraining from caching layers persistently on GH and from using the cache completely (unfortunately) when validating locally.

AkihiroSuda commented 9 months ago

BTW, I had some fun with understanding differences due to cached layers that are apparently not rewritten timestamp-wise.

Could you open an issue in https://github.com/moby/buildkit/issues ?

jan-kiszka commented 8 months ago

Done: https://github.com/moby/buildkit/issues/4748

codethief commented 7 months ago

@tianon

the service provided at snapshot.debian.org is not currently well-maintained or well-staffed (hence the speed issues, but it's a much deeper problem), so I would be very uncomfortable intentionally adding to them the load of all our image builds/rebuilds.

I just came across http://snapshot-cloudflare.debian.org . Given that @AkihiroSuda's repro-sources-list.sh uses it, I take it that mirror can be used more safely for the purposes of image (re-)builds?

tianon commented 7 months ago

Yes, I do not think that causes much (if any) less load on the upstream infrastructure given that I do not believe it caches the heavy queries, but cannot be certain as it does not seem to be officially documented or even discussed anywhere public.

There have been recent efforts within the project to reinvigorate support for the snapshot service (see https://lists.debian.org/debian-snapshot/2024/03/msg00000.html for the most recent posted notes) but they are going to take time.

AkihiroSuda commented 7 months ago

http://snapshot-cloudflare.debian.org

This seems still slow.

My current suggestion is to just build the upstream DOI images without pinning, and let downstream reproducers hook the Dockerfile to use snapshot[-cloudflare].debian.org or whatever they like.

AkihiroSuda commented 6 months ago

Here is the first batch of the PRs:

Is there anything left I have to do to get these PRs merged?

I saw a comment about ARG SOURCE_DATE_EPOCH, but didn't get what I have to do:

tianon commented 6 months ago

I'm sorry, this is still on my TODO list, but it is admittedly not a very high priority at the current time.

AkihiroSuda commented 6 months ago

I'm sorry, this is still on my TODO list, but it is admittedly not a very high priority at the current time.

@tianon I appreciate your hard work, and I know you have been very busy, but could you allocate one minute to help understanding your comment about ARG SOURCE_DATE_EPOCH? https://github.com/tianon/docker-bash/pull/38#discussion_r1586880059 ๐Ÿ™

I understand that reviewing and merging PRs may take a longer time, but I want to make sure that we are on the same direction.

AkihiroSuda commented 6 months ago

Is there anything I can do to keep this actionable? ๐Ÿ™

AkihiroSuda commented 5 months ago

Updated the PRs to take SOURCE_DATE_EPOCH from the source material (as in golang):

SOURCE_DATE_EPOCH="$(find /usr/src/bash -type f -exec stat -c '%Y' {} + | sort -nr | head -n1)"

Let me know if I'm still missing something to get these PRs merged ๐Ÿ™

AkihiroSuda commented 5 months ago

Also submitted a merge request to add a guide to https://reproducible-builds.org/ :

fmoessbauer commented 4 months ago

Unfortunately all debian images since 20240531T083821Z are no longer reproducible since the upstream mirror snapshots-cloudflare.d.o did not receive updates anymore. This is especially problematic, as the auto-redirect to latest snapshot logic hides this.

Xref: https://github.com/reproducible-containers/repro-sources-list.sh/issues/17

AkihiroSuda commented 4 months ago

Unfortunately all debian images since 20240531T083821Z are no longer reproducible since the upstream mirror snapshots-cloudflare.d.o did not receive updates anymore. This is especially problematic, as the auto-redirect to latest snapshot logic hides this.

Xref: reproducible-containers/repro-sources-list.sh#17

Looks like we should just use https://snapshot.debian.org/ now

fmoessbauer commented 4 months ago

Looks like we should just use https://snapshot.debian.org/ now

The ML thread suggests to use https://snapshot-mlm-01.debian.org/, but I don't trust these being long-term stable either. On snapshot.d.o we are rate-limited, making it more or less impractical for real-world use cases.

Anyways, all images since 20240531T083821Z are lost.

AkihiroSuda commented 4 months ago

On snapshot.d.o we are rate-limited, making it more or less impractical for real-world use cases.

dpkgs can be also cached to an OCI registry or whatever (without altering Dockerfile), if this PR can be merged:

Still I don't know what is the blocker to get this one (and other PRs) merged, though.

tianon commented 4 months ago

As far as I have seen, the folks maintaining snapshot do not officially maintain (nor recommend for long-term usage) any hostname/URL other than the canonical snapshot.debian.org, which as of a few weeks ago (see https://lists.debian.org/debian-snapshot/2024/07/msg00000.html for example), includes the new snapshot-mlm-01.debian.org server in the official rotation:

$ dig snapshot-mlm-01.debian.org +short
185.213.153.170
$ dig snapshot.debian.org +short
185.17.185.185
185.213.153.170

As I noted in https://github.com/docker-library/official-images/issues/16044#issuecomment-2043589921 and subsequently confirmed via IRC with the folks currently maintaining the snapshot service, I don't know what that snapshot-cloudflare URL is/was, but best guess is a partially implemented PoC that was never completed (and definitely never official advertised/recommended anywhere I'm aware of or can find, even searching old mailing list archives).

AkihiroSuda commented 4 months ago

Right, only the canonical snapshot.debian.org should be used for reproductions.

Anyway, this only matters for third-parties who want to verify reproducibility of upstream images ( with a dockerfile hook proposed in https://github.com/moby/buildkit/pull/4669 ).

The PRs for the upstream DOI builds do not need to use snapshot.debian.org or something similar, so they should be ready to merge as-is:

AkihiroSuda commented 3 months ago

Opened a PR to set rewrite-timestamp=true:

AkihiroSuda commented 1 month ago

What's the current blocker for the PRs (linked in the OP)?

codethief commented 2 weeks ago

Looks like snapshot.debian.org recently(?) introduced relatively strict rate limiting. Or at least our pipelines suddenly started failing this morning and our organization is fairly small (a few dozen Docker builds per day and most builds are cached, anyway).

AkihiroSuda commented 2 weeks ago

snapshot.debian.org

Yes, this server is quite slow and unstable. This should be only used in reproducers' builds, not in the upstream builds

fmoessbauer commented 2 weeks ago

Yes, this server is quite slow and unstable. This should be only used in reproducers' builds, not in the upstream builds

Well... If your products should use a stable baseline as well, you also have to use it there. I'm currently working together with the people behind snapshot.d.o to improve the situation. For details, see: