aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.21k stars 319 forks source link

[ECR] [request]: support cache manifest #876

Closed lifeofguenter closed 8 months ago

lifeofguenter commented 4 years ago

Would be great if ECR could support cache-manifest (see: https://medium.com/titansoft-engineering/docker-build-cache-sharing-on-multi-hosts-with-buildkit-and-buildx-eb8f7005918e)

NOTE FROM AWS: We shipped this on BuildKit 0.12, see here for details - https://aws.amazon.com/blogs/containers/announcing-remote-cache-support-in-amazon-ecr-for-buildkit-clients/. We are keeping this issue open for the time being to allow the community to discuss and gather further feedback

TBBle commented 3 years ago

BuildKit 0.8 will default to using an OCI media type for its caches (see https://github.com/moby/buildkit/pull/1746) which I assume should make this work, but I haven't tested it myself.

aleks-fofanov commented 3 years ago

It still doesn't work with recently released buildkit 0.8.0 It can write the layers and config, but it is unable to upload manifest to ECR:

=> ERROR exporting cache                                                                                                     5.4s
 => => preparing build cache for export                                                                                       0.2s
 => => writing layer sha256:0d48cc65d93fe2ee9877959ff98ebc98b95fe4b2fc467ff50f27103c1c5d6973                                  0.3s
 => => writing layer sha256:2ade286d53f2e045413601ca0e3790de3792ea34abd3d025cd2cd9c3cb5231de                                  0.3s
 => => writing layer sha256:64befcf53942ba04c144cde468548885d497e238001e965e983e39eb947860c2                                  0.3s
 => => writing layer sha256:7415f0cbea8739c1bf353568b16ac74a9cfbc0b36327602e3a025abf919a38a6                                  0.3s
 => => writing layer sha256:76a1f73c618c30eb1b1d90cf043fe3f855a1cce922d1fb47458defd3dbe1c783                                  0.3s
 => => writing layer sha256:8674739c0ada3e834b816667d26dd185aa5ea089f33701f11a05b7be03f43026                                  0.3s
 => => writing layer sha256:9dc80bcd2805b2a441bd69bc9468df2e81994239e34879567bed7bdef6cb605d                                  0.3s
 => => writing layer sha256:cbdbe7a5bc2a134ca8ec91be58565ec07d037386d1f1d8385412d224deafca08                                  0.3s
 => => writing layer sha256:ce4e6de84945ab498f65d16920c9b801dfea3792871e44f89e6438e232a690b3                                  0.3s
 => => writing layer sha256:d46583c5d4c69b34cb46866838d68f53a38686dc7f2d1347ae0f252e8eb0ed4c                                  0.2s
 => => writing config sha256:33c76a0f8a74a06e461926d8a8d1845371c0cf9e86753db2483a4873aede8889                                 2.0s
 => => writing manifest sha256:0f69a7e6626f6a24a0a95ed915613ebdf9459280d4986879480d87e34849aea8                               0.6s
------
 > importing cache manifest from XXXXXXXXXXXX.dkr.ecr.us-west-2.amazonaws.com/test-repo:buildcache:
------
------
 > exporting cache:
------
error: failed to solve: rpc error: code = Unknown desc = error writing manifest blob: failed commit on ref "sha256:0f69a7e6626f6a24a0a95ed915613ebdf9459280d4986879480d87e34849aea8": unexpected status: 400 Bad Request
errm commented 3 years ago

I am seeing the same error on buildkit 0.8.0

even when setting oci-mediatypes explicitly to true: --export-cache type=registry,ref=${REPO}:buildcache,oci-mediatypes=true

 => ERROR exporting cache                                                                                                                                                                                                                                                                                                                                                                                                                                                                1.4s
 => => preparing build cache for export                                                                                                                                                                                                                                                                                                                                                                                                                                                  0.0s
 => => writing layer sha256:757d39990544d20fbebf7a88e29a5dd2bb6a4fdb116d67df9fe8056843da794d                                                                                                                                                                                                                                                                                                                                                                                             0.1s
 => => writing layer sha256:7597eaba0060104f2bd4f3c46f0050fcf6df83066870767af41c2d7696bb33b2                                                                                                                                                                                                                                                                                                                                                                                             0.1s
 => => writing config sha256:0e308fd4eee4cae672eee133cbd77ef7c197fa5d587110b59350a99b289f7000                                                                                                                                                                                                                                                                                                                                                                                            0.8s
 => => writing manifest sha256:8eb142b16e0ec25db4517f2aecff795cca2b1adbe07c32f5c571efc5c808cbcd                                                                                                                                                                                                                                                                                                                                                                                          0.3s
------
 > importing cache manifest from xxx.dkr.ecr.us-east-1.amazonaws.com/errm/test:buildcache:
------
------
 > exporting cache:
------
error: failed to solve: rpc error: code = Unknown desc = error writing manifest blob: failed commit on ref "sha256:8eb142b16e0ec25db4517f2aecff795cca2b1adbe07c32f5c571efc5c808cbcd": unexpected status: 400 Bad Request

Deamon logs:

time="2020-12-09T13:42:48Z" level=info msg="running server on /run/buildkit/buildkitd.sock"
time="2020-12-09T13:44:09Z" level=warning msg="reference for unknown type: application/vnd.buildkit.cacheconfig.v0"
time="2020-12-09T13:44:10Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Unknown desc = error writing manifest blob: failed commit on ref \"sha256:8eb142b16e0ec25db4517f2aecff795cca2b1adbe07c32f5c571efc5c808cbcd\": unexpected status: 400 Bad Request\n"
AlexLast commented 3 years ago

Also seeing this for private repos, although it doesn't seem to be an issue with public ECR repos..

n1ru4l commented 3 years ago

Is there a timeframe for this feature request available? This could help to tremendously speed up CI builds.

jellevanhees commented 3 years ago

We have been experimenting with this buildkit feature for some time now and it works wonders. currently, we are still dependant upon dockehub so having this functionality in private ecr would greatly benefit our ci/cd workflow

davidfm commented 3 years ago

Any indication as to if/when this will ever be available? Using buildkit would really improve our CI build times

devopsmash commented 3 years ago

One year passed and still nothing 😔

pieterza commented 3 years ago

We'd really like to see support of this with ECR private repos 🙏

As of today, it still does not work:

error: failed to solve: rpc error: code = Unknown desc = error writing manifest blob: failed commit on ref "sha256:75f32e1bb4df7c6333dc352ea3ea9d04d1e04e4a14ba79b59daa019074166519": unexpected status: 400 Bad Request
hf commented 3 years ago

Yes please!

abatilo commented 3 years ago

Can we get any kind of communications on this?

renannprado commented 2 years ago

Is there any workaround available?

ynouri commented 2 years ago

For the teams using Github but wishing to keep images in ECR, it is possible to leverage the cache manifest support from Github Container Registry (GHCR) and push the image to ECR at the same time. When pushing to ECR, only new layers get pushed.

Github Actions workflow example:

jobs:

  docker_build:
    strategy:
      matrix:
        name:
          - my-image
        include:
          - name: my-image
            registry_ecr: my-aws-account-id.dkr.ecr.us-east-1.amazonaws.com
            registry_ghcr: ghcr.io/my-github-org-name
            dockerfile: ./path/to/Dockerfile
            context: .
            extra_args: ''

    steps:
      - uses: actions/checkout@v2

      - name: Install Buildkit
        uses: docker/setup-buildx-action@v1
        id: buildx
        with:
          install: true

      - name: Login to GitHub Container Registry
        uses: docker/login-action@v1
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1
          role-skip-session-tagging: true
          role-duration-seconds: 1800
          role-session-name: GithubActionsBuildDockerImages

      - name: Login to Amazon ECR
        uses: aws-actions/amazon-ecr-login@v1

      - name: Build & Push (ECR)
        # - https://docs.docker.com/engine/reference/commandline/buildx_build/
        # - https://github.com/moby/buildkit#export-cache
        run: |
          docker buildx build \
            --cache-from=type=registry,ref=${{ matrix.registry_ghcr }}/${{ matrix.name }}:cache \
            --cache-to=type=registry,ref=${{ matrix.registry_ghcr }}/${{ matrix.name }}:cache,mode=max \
            --push \
            ${{ matrix.extra_args }} \
            -f ${{ matrix.dockerfile }} \
            -t ${{ matrix.registry_ecr }}/${{ matrix.name }}:${{ github.sha }} \
            ${{ matrix.context }}
abatilo commented 2 years ago

@ynouri Just be careful of your storage costs in GHCR. It's oddly expensive. I found https://github.com/snok/container-retention-policy to help solve that use case for me.

kgns commented 2 years ago

ECR still not supporting this is unbelievably amateurish, it doesn't suit AWS...

pieterza commented 2 years ago

Is there any workaround available?

Use another Docker registry. Dockerhub, or perhaps your own tiny EC2 with some fat storage. Sucks, but AWS doesn't seem interested.

poldridge commented 2 years ago

This seems to have started working unannounced, at least when using docker 20.10.11 to build

ramosbugs commented 2 years ago

This seems to have started working unannounced, at least when using docker 20.10.11 to build

I'm still seeing error writing manifest blob with 400 Bad Request on Docker 5:20.10.12~3-0~ubuntu-focal, at least in us-west-2.

kgns commented 2 years ago

This seems to have started working unannounced, at least when using docker 20.10.11 to build

is this confirmed?

BeyondEvil commented 2 years ago

This seems to have started working unannounced, at least when using docker 20.10.11 to build

is this confirmed?

I'm wondering the same thing.

Could you share some more info @poldridge ?

eduard-malakhov commented 2 years ago

I've just faced the same issue with Docker version 20.10.12, build e91ed57. Would appreciate any hints or workarounds.

sherifabdlnaby commented 2 years ago

This seems to have started working unannounced, at least when using docker 20.10.11 to build

Did not work for me using docker:20.10.11-dind and ECR us-west-2.

sherifabdlnaby commented 2 years ago

Can we get any kind of communication on this? Being able to use remote cache is a major benefit to all our build pipelines.

diclophis commented 2 years ago

I am also super intrigued on a field report of what progress has occurred and what supported aspect of OCI layer caching are supported in ECR right now

ayk33 commented 2 years ago

Do we have any update on this? Can we get any kind of response from AWS?

pieterza commented 2 years ago

Would also like to know when this will be available

erebe commented 2 years ago

Just stumbled on this during our migration to AWS. This kind of sucks as it break our pipeline logic ... Would be interested also for an ETA for this feature

hlarsen commented 2 years ago

We gave up a while back and threw up our own (ALB + EC2 + S3) registry - setup was pretty quick. We finally got around to trying it and it appears to work great. We're still storing/pulling images in ECR for ECS; we only use our registry for the cache.

https://docs.docker.com/registry/deploying/

Still need to look into automatically cleaning up old images...

chavan-suraj commented 2 years ago

Waiting on this very important feature request. This is blocking our migration to Graviton instances as multi-arch build caching not working without this one and causing the build to take too much time for completion.

arunsollet commented 2 years ago

We have started looking into the technical approaches & feasibility to support this.

Garrett-R commented 2 years ago

@arunsollet would you have an ETA on this?

rstanevich commented 2 years ago

just in case

buildkit's inline cache works as expected on AWS ECR.

And yes, if you want to use layer caching for multi-stage builds, you need separate registry cache which doesn't work on AWS ECR.

sherifabdlnaby commented 2 years ago

@rstanevich Inline Cache has been very inconsistent with us, especially with multistage images, it gets cache misses when it shouldn't be.

rstanevich commented 2 years ago

@rstanevich Inline Cache has been very inconsistent with us, especially with multistage images, it gets cache misses when it shouldn't be.

Yep, inline cache is not designed for multistage builds since it stores only layers from the final stage. In other cases I found inline to be a better solution.

For multistage build cache we still can use sonatype nexus, artifactory, dockerhub, etc, and setting up a strict retention policy there.

pieterza commented 2 years ago

@rstanevich Inline Cache has been very inconsistent with us, especially with multistage images, it gets cache misses when it shouldn't be.

Yep, inline cache is not designed for multistage builds since it stores only layers from the final stage. In other cases I found inline to be a better solution.

For multistage build cache we still can use sonatype nexus, artifactory, dockerhub, etc, and setting up a strict retention policy there.

Let's please stop talking about alternatives, we need AWS ECR to support multistage layer cache, don't care if dockerhub does it.

MattDelac commented 2 years ago

We have started looking into the technical approaches & feasibility to support this.

Hi @arunsollet any progress regarding this issue?

arunsollet commented 2 years ago

We continue to investigate and will provide an update as soon as more information is available.

tomprimozic commented 2 years ago

until AWS fixes this, I found this solution - there's quite a lot of subtlety that took me a while to figure out, but now it works!

this is specifically for (1) caching multi-stage builds (2) on fresh instances (with no instance-local cache)

basically, do this:

export DOCKER_BUILDKIT=1

docker pull 000000000000.dkr.ecr.us-east-1.amazonaws.com/my-project:build || true

docker build . -f docker/Dockerfile --target build --build-arg BUILDKIT_INLINE_CACHE=1 \
  -t 000000000000.dkr.ecr.us-east-1.amazonaws.com/my-project:build \
  --cache-from 000000000000.dkr.ecr.us-east-1.amazonaws.com/my-project:build

docker pull 000000000000.dkr.ecr.us-east-1.amazonaws.com/my-project:latest || true

docker build . -f docker/Dockerfile --build-arg BUILDKIT_INLINE_CACHE=1 \
  -t 000000000000.dkr.ecr.us-east-1.amazonaws.com/my-project:latest \
  --cache-from 000000000000.dkr.ecr.us-east-1.amazonaws.com/my-project:build \
  --cache-from 000000000000.dkr.ecr.us-east-1.amazonaws.com/my-project:latest

docker push 000000000000.dkr.ecr.us-east-1.amazonaws.com/my-project:build
docker push 000000000000.dkr.ecr.us-east-1.amazonaws.com/my-project:latest

if anyone has issues, feel free to contact me!

rafavallina commented 2 years ago

Hi everyone! I'm Rafa, a new PM in the team, and I'm taking over investigating this issue. Would love to chat with some of the people in here that have expressed an interest in it - we want to make sure that this is well built in a way that is useful and doesn't introduce any problems or issues.

I'm getting up to speed, so some education would definitely help. I know every has been quite patient, so I apologize in advance for asking for more.

You can DM me at twitter.com/rafavallina or also reach out to your account team. If that doesn't work, please leave a comment here with your preferred way to get in touch and we'll find time!

automartin5000 commented 1 year ago

Hi everyone! I'm Rafa, a new PM in the team, and I'm taking over investigating this issue. Would love to chat with some of the people in here that have expressed an interest in it - we want to make sure that this is well built in a way that is useful and doesn't introduce any problems or issues.

I'm getting up to speed, so some education would definitely help. I know every has been quite patient, so I apologize in advance for asking for more.

You can DM me at twitter.com/rafavallina or also reach out to your account team. If that doesn't work, please leave a comment here with your preferred way to get in touch and we'll find time!

Use case is: GitHub Actions Docker layer caching sucks and this feature would be a better alternative

diclophis commented 1 year ago

A fair number of "hosted CI" platforms (similar to gha) have this problem with utilizing docker layer caching... most are programmed with extremely ephemeral runtimes (which is a good thing).... except, it would be really advantageous to have the docker layer caching be supported at the registry/repo layer because it is most durable there.

dschaaff commented 1 year ago

As others have mentioned, we'd like to use the feature to improve caching of builds in GitLab CI pipelines. This particular feature is required to be able to cache the intermediary stages in multi-stage docker files. Without there is no way to cache those layers, even when nothing has changed. The ability to do this would significantly speed up the build time for a few applications in our stack. Saving 5 mins on every build pipeline really adds up after a while.

Makeshift commented 1 year ago

A not-ideal-but-workable workaround for GitHub Actions at the moment is to use docker/build-push-action with the buildkit engine and utilise the experimental s3 cache on buildkit's master branch, like so (extracted from a private action, so apologies for all the variables, but hopefully still gives an idea of how to implement it):

runs:
  using: composite
  steps:
    - name: Set up Docker Buildx
      id: buildx
      uses: docker/setup-buildx-action@v2
      with:
        # We're using an experimental version of buildx for s3 support
        driver-opts: image=moby/buildkit:master

    - name: Configure AWS Credentials
      uses: aws-actions/configure-aws-credentials@v1
      with:
        aws-region: eu-west-2
        aws-access-key-id: ${{ inputs.AWS_ECR_ACCESS_KEY_ID }}
        aws-secret-access-key: ${{ inputs.AWS_ECR_SECRET_ACCESS_KEY }}
        role-session-name: "${{ github.event.repository.name }}-GithubActions-build.yml"

    - name: Login to AWS ECR
      if: success()
      uses: aws-actions/amazon-ecr-login@v1

    - name: Docker metadata (${{ inputs.AWS_ECR_REPOSITORY }})
      if: success()
      id: meta-conf
      uses: docker/metadata-action@v4.0.1
      with:
        images: ${{ inputs.AWS_ECR_REPOSITORY }}
        tags: |
          # Rather than using type=schedule which only creates a tag on a schedule, we want date tags on every build
          type=raw,value={{date 'YYYYMMDD'}}${{ inputs.TAG_SUFFIX }}
          # refs
          type=ref,event=branch,suffix=${{ inputs.TAG_SUFFIX }}
          type=ref,event=tag,suffix=${{ inputs.TAG_SUFFIX }}
          type=ref,event=pr,suffix=${{ inputs.TAG_SUFFIX }}
          # sha's
          type=sha,prefix=,suffix=${{ inputs.TAG_SUFFIX }}
          type=sha,prefix=,format=long,suffix=${{ inputs.TAG_SUFFIX }}
          # set latest tag for master branch
          type=raw,value=latest,enable=${{ github.ref_name == 'master' }},suffix=${{ inputs.TAG_SUFFIX }}

    - name: Push Container (${{ inputs.AWS_ECR_REPOSITORY }})
      if: success()
      id: build
      uses: docker/build-push-action@v3.1.1
      with:
        file: ${{ steps.dockerfile.outputs.result }}
        context: ${{ steps.build-context-dirname.outputs.value }}
        tags: ${{ steps.meta-conf.outputs.tags }}
        labels: ${{ steps.meta-conf.outputs.labels }}
        platforms: ${{ inputs.PLATFORMS }}
        builder: ${{ steps.buildx.outputs.name }}
        push: true
        pull: true
        # ECR _still_ doesn't support inline-cache manifests, so we use the experimental s3 cache instead
        # https://github.com/aws/containers-roadmap/issues/876
        # https://github.com/moby/buildkit#s3-cache-experimental
        # Use a separate scope per dockerfile to avoid conflicts
        cache-from: type=s3,region=eu-west-2,bucket=<BUCKET NAME>,prefix=gha_docker_build_cache/${{ github.event.repository.name }}/${{ steps.dockerfile_underscored.outputs.value }}
        cache-to: type=s3,region=eu-west-2,bucket=<BUCKET NAME>,prefix=gha_docker_build_cache/${{ github.event.repository.name }}/${{ steps.dockerfile_underscored.outputs.value }},mode=max
        build-args: ${{ inputs.BUILD_ARGS }}

You can do it without docker/build-push-action and just call the build via cli using the correct cache-from and cache-to args as well, but I don't have an example of that handy.

rstanevich commented 1 year ago

Yes, @Makeshift, experimental remote S3 cache seems to be the best option for AWS now. I see one point that needs to be improved - currently, we must push the same blobs from the final image to both ECR and S3. When ECR supports the cache manifest, I expect cache and image pushing to be faster.

rafavallina commented 1 year ago

Thanks so much to everyone that provided more into here and in Twitter.

We are investigating the best way to proceed. Buildkit is relying on a relatively unorthodox use of the OCI specification to enable this feature (see here), and we want to make sure we do our best to continue adhering to the standard while we support customers. OCI Artifacts (see here) seem to be there more adequate tool for the cache manifest. I've asked for more information on the Buildkit repo to see the progress being made there, and we are also looking at the effort on our side.

Is anyone in this thread using or intending to use other tools than Docker buildx to build their images using cache manifest? Want to make sure we are thorough as possible and support customers across the board.

Would also love to hear how people are working around this limitation today, as it provides additional context about the underlying customer need

Makeshift commented 1 year ago

Would also love to hear how people are working around this limitation today, as it provides additional context about the underlying customer need

In our case, before Buildx had support for s3 caches - since we host an s3-backed registry as a pull-through cache for our ECS clusters/build workers anyway(to cache base images pulled from docker hub, as they implemented pull restrictions), it was pretty easy to configure that as a build cache store as well. I can understand why people are frustrated at having to host an additional service as well though, as for some it can mean quite a lot of additional infrastructure to support it, if they're not able to just tack it on to their existing setup.

Thank you for the links you provided, by the way. It really helped my understanding of how ECS (and apparently most suppliers who follow the spec!) came to not support caches produced by buildx.

tavlima commented 1 year ago

FWIW, my team uses buildctl directly, instead of the buildx wrapper.

On Sat, Oct 22, 2022 at 10:45 Connor Bell @.***> wrote:

Would also love to hear how people are working around this limitation today, as it provides additional context about the underlying customer need

In our case, before Buildx had support for s3 caches - since we host an s3-backed registry as a pull-through cache https://docs.docker.com/registry/recipes/mirror/ for our ECS clusters/build workers anyway(to cache base images pulled from docker hub, as they implemented pull restrictions), it was pretty easy to configure that as a build cache store as well. I can understand why people are frustrated at having to host an additional service as well though, as for some it can mean quite a lot of additional infrastructure to support it, if they're not able to just tack it on to their existing setup.

Thank you for the links you provided, by the way. It really helped my understanding of how ECS (and apparently most suppliers who follow the spec!) came to not support caches produced by buildx.

— Reply to this email directly, view it on GitHub https://github.com/aws/containers-roadmap/issues/876#issuecomment-1287798023, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJZKI2TOTMYSY4BPWEBEHTWEPVYTANCNFSM4MZQVARQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Thiago Augusto V. Lima

automartin5000 commented 1 year ago

FWIW, my team uses buildctl directly, instead of the buildx wrapper.

Do you have an example of that?

tavlima commented 1 year ago

Do you have an example of that?

Sure. Please see the examples here.

dmarkey commented 1 year ago

For those looking for guidance on the S3 build cache with buildx this worked for me..

    - uses: actions/checkout@v3
    - name: configure aws credentials
    - name: Set up QEMU
      uses: docker/setup-qemu-action@v2
    - name: Set up Docker Buildx
      uses: docker/setup-buildx-action@v2
    - name: Buildx master
      run: docker buildx create --bootstrap --driver docker-container --driver-opt image=moby/buildkit:master --use

and

docker buildx build . --cache-from=type=s3,region=eu-west-1,bucket=bucket-name,name=docker-cache/myapp,access_key_id=$(AWS_ACCESS_KEY_ID),secret_access_key=$(AWS_SECRET_ACCESS_KEY),session_token=$(AWS_SESSION_TOKEN) --cache-to=type=s3,region=eu-west-1,bucket=bucket-name,name=docker-cache/myapp,access_key_id=$(AWS_ACCESS_KEY_ID),secret_access_key=$(AWS_SECRET_ACCESS_KEY),session_token=$(AWS_SESSION_TOKEN) 

Will probably wait until S3 cache is released before using in production.