docker-library / official-images

Primary source of truth for the Docker "Official Images" program
https://hub.docker.com/u/library
Apache License 2.0
6.34k stars 2.31k forks source link

Reproducible builds #16044

Open AkihiroSuda opened 5 months ago

AkihiroSuda commented 5 months ago

PRs:

I also talked about this at DockerCon 2023: https://medium.com/nttlabs/dockercon-2023-reproducible-builds-with-buildkit-for-software-supply-chain-security-0e5aedd1aaa7

Scope of reproduction

The digests of the image manifest blobs (and config blobs and layer blobs) should be reproducible. https://github.com/reproducible-containers/diffoci can be used for testing reproducibility.

SLSA provenance manifest blobs (enabled by default in recent buildx) are not reproducible by design, so the image index digest will not be reproducible.

Reproducing base images

Base images have to be pinned by the sha256 digest for reproduction.

A digest can be embedded in a FROM instruction of a Dockerfile. However, I wonder that image maintainers might not want to update Dockerfiles frequently to ensure the latest base image to be picked up during the upstream build.

In that case, we can just leave FROM instructions unpinned, and let reproducers to use the COVNERT action of source policies to dynamically replace the image identifier:

https://github.com/moby/buildkit/blob/v0.13.0-beta1/docs/build-repro.md

{
  "action": "CONVERT",
  "selector": {
    "identifier": "docker-image://docker.io/library/alpine:latest"
  },
  "updates": {
    "identifier": "docker-image://docker.io/library/alpine:latest@sha256:4edbd2beb5f78b1014028f4fbb99f3237d9561100b6881aabbf5acce2c4f9454"
  }
}

The digest to be used for the CONVERT action is recored in the SLSA provenance. We need to update buildx CLI to automatically generate CONVERT actions from SLSA provenances. (See the buildx CLI section below)

Reproducing package versions

Debian and Ubuntu

Debian and Ubuntu keep old packages on http://snapshot.debian.org and http://snapshot.ubuntu.com. I wrote a script to rewrite /etc/apt/sources.list to use those snapshot servers: https://github.com/reproducible-containers/repro-sources-list.sh/blob/master/repro-sources-list.sh The snapshot timestamp can be supplied via $SOURCE_DATE_EPOCH.

These snapshot servers are quite slow and sometimes flaky (especially for Debian), so, probably the snapshot servers shouldn't be used for the upstream builds.

See https://github.com/moby/buildkit/pull/4669 for how to rewrite /etc/apt/sources.list in downstream builds.

Alpine

Reproducing apk packages is still challenging, as Alpine does not have snapshot servers.

The long-term plan is to capture apk packages on building and attach them to the image as artifacts:

Reproducing file timestamps

BuildKit v0.13 supports rewriting the timestamps of the files inside image layers to use $SOURCE_DATE_EPOCH.

--output type=image,name=docker.io/username/image,push=true,rewrite-timestamp=true

https://github.com/moby/buildkit/blob/v0.13.0-beta1/docs/build-repro.md#source_date_epoch

Removal of logs, etc.

The entries in /var/cache/ldconfig/aux-cache are organised as an associative array, with the keys including file attributes like device number, inode number and inode change time. This means it is not only unreproducible, but completely useless at boot time since the device and inode numbers of libraries will be different.

https://linux.debian.kernel.narkive.com/7wfNAf7A/bug-845034-initramfs-tools-please-ensure-initrd-images-are-reproducible#post3

Reproducing file contents

Some dockerfiles will need extra work for reproducing file contents. e.g., sorting arrays, removing randomized mktemp, ...

e.g., in case of gcc:

diff -ur --no-dereference a/usr/local/lib64/libgo.la b/usr/local/lib64/libgo.la
--- a/usr/local/lib64/libgo.la  2024-01-12 18:14:56.000000000 +0900
+++ b/usr/local/lib64/libgo.la  2024-01-12 18:21:45.000000000 +0900
@@ -17,7 +17,7 @@
 inherited_linker_flags=' -pthread'

 # Libraries that this one depends upon.
-dependency_libs=' -L/tmp/tmp.LWUIKDJ22E/x86_64-linux-gnu/libatomic/.libs -lpthread -lm'
+dependency_libs=' -L/tmp/tmp.yeTnsy0FEm/x86_64-linux-gnu/libatomic/.libs -lpthread -lm'

 # Names of additional weak libraries provided by this library
 weak_library_names=''

Buildx CLI

Buildx CLI should be updated to allow attesting reproducibility with a few commands. Notably, buildx build should have a flag like --repro from=gcc@sha256@... to import build args and base image digests from an SLSA provenance:

$ # "none://" is a filler for the build context arg
$ docker buildx build \
  --load \
  -t gcc:local \
  --repro from=gcc@sha256:f97e2719cd5138c932a814ca43f3ca7b33fde866e182e7d76d8391ec0b05091f \
  none://
...
[amd64] Using SLSA provenance sha256:7ecde97c24ea34e1409caf6e91123690fa62d1465ad08f638ebbd75dd381f08f
[amd64] Importing Dockerfile blob embedded in the provenance
[amd64] Importing build context https://github.com/docker-library/gcc.git#af458ec8254ef7ca3344f12631e2356b20b4a7f1:13
[amd64] Importing build-arg SOURCE_DATE_EPOCH=1690467916
[amd64] Importing buildpack-deps:bookworm from docker-image://buildpack-deps:bookworm@sha256:bccdd9ebd8dbbb95d41bb5d9de3f654f8cd03b57d65d090ac330d106c87d7ed
...

$ diffoci diff gcc@sha256:f97e2719cd5138c932a814ca43f3ca7b33fde866e182e7d76d8391ec0b05091f gcc:local
...

CI

We will also need to have a CI to periodically attest reproducibility with the proposed CLI above.

sudo-bmitch commented 5 months ago

SLSA provenance manifest blobs (enabled by default in recent buildx) are not reproducible by design, so the image index digest will not be reproducible.

The index could be reproducible if the artifacts were associated using the upcoming OCI subject/referrers API rather than injecting it directly in the index.

A digest can be embedded in a FROM instruction of a Dockerfile. However, I wonder that image maintainers might not want to update Dockerfiles frequently to ensure the latest base image to be picked up during the upstream build.

In that case, we can just leave FROM instructions unpinned, and let reproducers to use the COVNERT action of source policies to dynamically replace the image identifier

The push from groups like Scorecard is to embed the digest directly in the Dockerfile and maintain the pin with a tool like renovate or dependabot. This gives reproducibility with controlled updates.

We will also need to have a CI to periodically attest reproducibility with the proposed CLI above.

I'd love to see work done on a user tool that can take any built image, or git commit, and verify the output. We can assist this work by annotating the images with details like the git commit used to build them and the SOURCE_DATE_EPOC value. This could allow an independent 3rd party verification of images, and signatures or attestations pushed to their repository as detached artifacts (artifacts referencing an image that is not in that repository). Users could then add a policy on their side to require a 3rd party attestation from a list of trusted rebuilders.

AkihiroSuda commented 5 months ago

The index could be reproducible if the artifacts were associated using the upcoming OCI subject/referrers API rather than injecting it directly in the index.

Yes, but I guess adoption of the spec v1.1 in DOI is likely to take years, as user's site-local mirrors are mostly not ready for v1.1 yet

The push from groups like Scorecard is to embed the digest directly in the Dockerfile and maintain the pin with a tool like renovate or dependabot. This gives reproducibility with controlled updates.

Yes, but I wonder that maintainers might not want to see tons of bump-up PRs to happen every day.

sudo-bmitch commented 5 months ago

The index could be reproducible if the artifacts were associated using the upcoming OCI subject/referrers API rather than injecting it directly in the index.

Yes, but I guess adoption of the spec v1.1 in DOI is likely to take years, as user's site-local mirrors are mostly not ready for v1.1 yet

OCI 1.0 conformant mirrors already support it by use of a fallback tag. It's similar to how sigstore implements their attestations today.

The push from groups like Scorecard is to embed the digest directly in the Dockerfile and maintain the pin with a tool like renovate or dependabot. This gives reproducibility with controlled updates.

Yes, but I wonder that maintainers might not want to see tons of bump-up PRs to happen every day.

There's a constant bump-up already. One is invisible and hurts reproducibility, while the other is documented with a Git commit.

AkihiroSuda commented 5 months ago

The first PR:

tianon commented 5 months ago

I don't have the bandwidth to respond to all of this right now, so I'll focus on the bit that concerns me the most: the service provided at snapshot.debian.org is not currently well-maintained or well-staffed (hence the speed issues, but it's a much deeper problem), so I would be very uncomfortable intentionally adding to them the load of all our image builds/rebuilds.

At a high level, I think the best place to start here (especially as it would be the least disruptive) is making useful/interesting layers reproducible. For example, https://github.com/docker-library/golang/commit/46f40bd7c5706e873f5177719159849baa95b275 is an old PoC that @yosifkit worked on which provides that for the Go-providing layer of the Go images that I've been planning to revisit in the near future.

AkihiroSuda commented 5 months ago

I don't have the bandwidth to respond to all of this right now, so I'll focus on the bit that concerns me the most: the service provided at snapshot.debian.org is not currently well-maintained or well-staffed (hence the speed issues, but it's a much deeper problem), so I would be very uncomfortable intentionally adding to them the load of all our image builds/rebuilds.

This proposal does NOT enable snapshot.debian.org for the upstream builds. Third-party reproducers may opt-in to use snapshot.debian.org by providing --secret id=enable-repro-sources-list,source=/dev/null to repro the packages that were on the regular debian.org at the time of the SOURCE_DATE_EPOCH. (This secret is not really "secret" but made as a secret to avoid affecting the history object in OCI Image Config blob)

Could you take a look at the PR again ?

At a high level, I think the best place to start here (especially as it would be the least disruptive) is making useful/interesting layers reproducible. For example, docker-library/golang@46f40bd is an old PoC that @yosifkit worked on which provides that for the Go-providing layer of the Go images that I've been planning to revisit in the near future.

In the case of httpd image, the useful/interesting layer is the layer of the binaries such as /usr/local/bin/httpd. To reproduce the binaries we have to reproduce at least the compiler packages.

AkihiroSuda commented 5 months ago

BTW snapshot.ubuntu.com hosted on Azure seems quite fast enough to adopt as the default. https://ubuntu.com/blog/ubuntu-snapshots-on-azure-ensuring-predictability-and-consistency-in-cloud-deployments

Would it be acceptable to add Ubuntu variants to DOI? This will be also beneficial for Ubuntu users who are not interested in reproducible builds

AkihiroSuda commented 5 months ago

@tianon Could you take a look? 🙏

codethief commented 5 months ago

I'm only an interested bystander but I've long been looking for reproducible images, so that, using them as base images, I can create reproducible images for my projects, too.

@AkihiroSuda Would your proposal also include publishing those images on the daily, tagged by date (i.e. corresponding to the snapshot package sources), like @tianon currently does for Debian on a ~monthly basis?

This would be incredibly useful downstream as one could then simply bump the date tag of the base image on the daily to upgrade dependencies & protect against vulnerabilities while still preserving reproducibility.

AkihiroSuda commented 5 months ago

@AkihiroSuda Would your proposal also include publishing those images on the daily, tagged by date (i.e. corresponding to the snapshot package sources), like @tianon currently does for Debian on a ~monthly basis?

This is an orthogonal topic.

jan-kiszka commented 3 months ago

FWIW, I've went through the not-yet-fully-smooth process of making the kas build containers for Yocto and Isar reproducible (https://github.com/siemens/kas/commits/next). Our containers as based on Debian, and it would be great to see our dependencies reproducible as well.

BTW, I had some fun with understanding differences due to cached layers that are apparently not rewritten timestamp-wise. Is that a known issue of BuildKit? I'm now refraining from caching layers persistently on GH and from using the cache completely (unfortunately) when validating locally.

AkihiroSuda commented 3 months ago

BTW, I had some fun with understanding differences due to cached layers that are apparently not rewritten timestamp-wise.

Could you open an issue in https://github.com/moby/buildkit/issues ?

jan-kiszka commented 3 months ago

Done: https://github.com/moby/buildkit/issues/4748

codethief commented 2 months ago

@tianon

the service provided at snapshot.debian.org is not currently well-maintained or well-staffed (hence the speed issues, but it's a much deeper problem), so I would be very uncomfortable intentionally adding to them the load of all our image builds/rebuilds.

I just came across http://snapshot-cloudflare.debian.org . Given that @AkihiroSuda's repro-sources-list.sh uses it, I take it that mirror can be used more safely for the purposes of image (re-)builds?

tianon commented 2 months ago

Yes, I do not think that causes much (if any) less load on the upstream infrastructure given that I do not believe it caches the heavy queries, but cannot be certain as it does not seem to be officially documented or even discussed anywhere public.

There have been recent efforts within the project to reinvigorate support for the snapshot service (see https://lists.debian.org/debian-snapshot/2024/03/msg00000.html for the most recent posted notes) but they are going to take time.

AkihiroSuda commented 2 months ago

http://snapshot-cloudflare.debian.org

This seems still slow.

My current suggestion is to just build the upstream DOI images without pinning, and let downstream reproducers hook the Dockerfile to use snapshot[-cloudflare].debian.org or whatever they like.

AkihiroSuda commented 1 month ago

Here is the first batch of the PRs:

Is there anything left I have to do to get these PRs merged?

I saw a comment about ARG SOURCE_DATE_EPOCH, but didn't get what I have to do:

tianon commented 1 month ago

I'm sorry, this is still on my TODO list, but it is admittedly not a very high priority at the current time.

AkihiroSuda commented 1 month ago

I'm sorry, this is still on my TODO list, but it is admittedly not a very high priority at the current time.

@tianon I appreciate your hard work, and I know you have been very busy, but could you allocate one minute to help understanding your comment about ARG SOURCE_DATE_EPOCH? https://github.com/tianon/docker-bash/pull/38#discussion_r1586880059 🙏

I understand that reviewing and merging PRs may take a longer time, but I want to make sure that we are on the same direction.

AkihiroSuda commented 1 month ago

Is there anything I can do to keep this actionable? 🙏

AkihiroSuda commented 3 weeks ago

Updated the PRs to take SOURCE_DATE_EPOCH from the source material (as in golang):

SOURCE_DATE_EPOCH="$(find /usr/src/bash -type f -exec stat -c '%Y' {} + | sort -nr | head -n1)"

Let me know if I'm still missing something to get these PRs merged 🙏

AkihiroSuda commented 1 week ago

Also submitted a merge request to add a guide to https://reproducible-builds.org/ :