aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.5k stars 3.84k forks source link

aws_ecr_assets: produces invalid tasks by linking to empty "attestation" image layer #30258

Open anentropic opened 3 months ago

anentropic commented 3 months ago

Describe the bug

I started getting the following error when trying to run my Fargate tasks:

"StoppedReason": "CannotPullContainerError: ref pull has been retried 1 time(s): failed to unpack image on snapshotter overlayfs: failed to extract layer sha256:c02342326b04a05fa0fc4c703c4aaa8ffb46cc0f2eda269c4a0dee53c8296182: failed to get stream processor for application/vnd.in-toto+json: no processor for media-type: unknown"

If I go into AWS web ui to the task definition I can find the id of the ECR image that it points to

Then if I look at that ECR image I can see it has 0 size:

Screenshot 2024-05-17 at 17 12 22

I can see in my ECR images list that since 10 May every CDK deployment has pushed a zero size image to ECR instead of the expected one: Screenshot 2024-05-17 at 17 14 48

I have the following CDK code:

        django_command_img = ecr_assets.DockerImageAsset(
            self,
            "Django Command Image",
            directory="./",
            target="fargate-task",
            build_args={
                "python_version": global_task_config.python_version,
            },
            platform=ecr_assets.Platform.LINUX_ARM64
            if is_arm64
            else ecr_assets.Platform.LINUX_AMD64,
        )

        _task = ecs.FargateTaskDefinition(
            self,
            "Django Command Task",
            cpu=config.cpu,
            memory_limit_mib=config.memory_size,
            runtime_platform=ecs.RuntimePlatform(
                cpu_architecture=ecs.CpuArchitecture.ARM64
                if is_arm64
                else ecs.CpuArchitecture.X86_64,
                operating_system_family=ecs.OperatingSystemFamily.LINUX,
            ),
            family=f"{resource_prefix}-task-django-command",
        )
        container_name = f"{resource_prefix}-container-django-command"
        _task.add_container(
            "Django Command Container",
            image=ecs.ContainerImage.from_docker_image_asset(django_command_img),
            container_name=container_name,
            environment={
                "ETL": "true",
                **common_django_env,
            },
            # health_check=ecs.HealthCheck(),
            logging=ecs.LogDrivers.aws_logs(
                stream_prefix="containers",
                log_group=log_group,
            ),
        )

(Before 10 May I had previously deployed and ran Fargate tasks successfully from this definition)

Expected Behavior

A usable ECS task definition is deployed

Current Behavior

Inscrutable error message

It appears that CDK has created the task definition against an invalid ECR image

Reproduction Steps

See above

Additional Information/Context

I have located what seems to be the cause, with help from this issue thread: https://github.com/moby/moby/issues/45600

Using aws ecr batch-get-image I can see the following manifest in my problem zero sized image:

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:1b22b41bc8846dc11d2718b173aa7481e06fe4e3893b410032e02f858da7d165",
    "size": 167
  },
  "layers": [
    {
      "mediaType": "application/vnd.in-toto+json",
      "digest": "sha256:875402577bcf06a0681056a91feeb0ce68f41fa30ad735ae802e555f1519351d",
      "size": 1464,
      "annotations": {
        "in-toto.io/predicate-type": "https://slsa.dev/provenance/v0.2"
      }
    }
  ]
}

This seems to relate to the error message and fit with the details in the moby issue linked above.

Basically when cdk deploy builds the image locally (via docker buildx) then extra "attestation" items are added to the root manifest (???)

I guess by themselves these aren't harmful (they are part of OCI standard or whatever) but CDK is maybe not expecting them and ends up pushing and tagging the wrong thing into ECR

Possible Solution

BUILDX_NO_DEFAULT_ATTESTATIONS=1 cdk deploy worked for me (after adding an arbitrary change to my Dockerfile to force a rebuild)

I think it would be better if CDK explicitly adds --provenance=false in its calls to docker buildx

See https://docs.docker.com/reference/cli/docker/buildx/build/#provenance and https://docs.docker.com/build/attestations/attestation-storage/

CDK CLI Version

2.142.0 (build 289a1e3)

Framework Version

No response

Node.js Version

v18.18.0

OS

macOS 14.4.1

Language

Python

Language Version

3.11.5

Other information

No response

anentropic commented 3 months ago

What is a bit strange is it would make most sense if this resulted from a recent change in say Docker Desktop

But AFAICT the adding of attestations by default dates back a lot longer than last week (to maybe Jan 2023 https://www.docker.com/blog/highlights-buildkit-v0-11-release/#1-slsa-provenance)

So maybe I just got lucky the first couple of times I deployed my Fargate task

anentropic commented 3 months ago

I think this issue is also affecting Lambda functions that use the Docker image runtime

I had a bunch of deployment issues the last couple of days where I would get an error like:

18:14:44 | UPDATE_FAILED | AWS::Lambda::Function | LambdasFromDockerImageCommand996BC9A4 Resource handler returned message: "Resource of type 'AWS::Lambda::Function' with identifier 'docker-lambda-command' did not stabilize." (RequestToken: 654bcf62-b81e-bf4c-1eff-408af63620cc, HandlerErrorCode: NotStabilized)

At first I thought there was something wrong in the meaningful part of the Dockerfile they were built from, I made a change and redeployed and the problem seemed to go away.

But then later when deploying from another branch I made the same 'fix' and it didn't work.

Subsequently I made a connection between my ECS/Fargate issue above and the fact that only my 'Docker image' Lambda functions seemed to have deployment problems, the 'Python runtime' one was doing ok.

I tried now BUILDX_NO_DEFAULT_ATTESTATIONS=1 cdk deploy and it failed

But that was because I hadn't changed the Dockerfile so new image was not pushed (?)

Then I added a RUN echo "hello" to the Dockerfile and tried again and this time they deployed ok.

To be clear I think without the BUILDX_NO_DEFAULT_ATTESTATIONS=1 flag it is random whether we end up with a usable ECR image or not.

If attestations are important then I think there is something in CDK that needs to be aware of them to avoid this issue (pushing and tagging wrong part of OCI manifest as the image to use). Or else just force buildx not to create them in the first place.

pahud commented 3 months ago

(Before 10 May I had previously deployed and ran Fargate tasks successfully from this definition)

Did you redeploy it after that time? What you have changed as it seemed to be working before?

And, are you able to reproduce this issue by providing a sample Dockerfile and CDK code snippets so we could reproduce that in our environment?

anentropic commented 3 months ago

I believe it's random whether right image part gets tagged and pushed

so any reproduction attempt is going to need some way to repeatedly force the docker image to be rebuilt

I am puzzled why it started happening now, I assumed some update to either cdk or Docker Desktop

anentropic commented 3 months ago

Another possibility is that it relates to build/deploy failures

All of the below is omitting the BUILDX_NO_DEFAULT_ATTESTATIONS=1 env flag

I just had this failure from a cdk deploy:

#10 DONE 40.3s

#11 exporting to image
#11 exporting layers
#11 exporting layers 10.8s done
#11 exporting manifest sha256:560ce72bba99f87f99b5a72af0501d6d04fe6abd49f31585a4c9efc4b6cf8f37 0.0s done
#11 exporting config sha256:4a44d9a6522f7abf267fb08a4dd7892f78cd3782a3044e7defe305731fec6c0e done
#11 exporting attestation manifest sha256:dc2b75a2fe487b379f48f6dc7ee4700084e396689201e122f7ebed55214a1c4a 0.0s done
#11 exporting manifest list sha256:41f53826a2731afc38744f496350aab4e2a225d935690c12c671e34b612d3ccd 0.0s done
#11 naming to docker.io/library/cdkasset-6c0f67b12e981df085f476d81628516616283374442f7afb80c034078279e264:latest done
#11 unpacking to docker.io/library/cdkasset-6c0f67b12e981df085f476d81628516616283374442f7afb80c034078279e264:latest
#11 unpacking to docker.io/library/cdkasset-6c0f67b12e981df085f476d81628516616283374442f7afb80c034078279e264:latest 3.4s done
#11 DONE 14.3s

View build details: docker-desktop://dashboard/build/desktop-linux/desktop-linux/hnu3dvjn0b1qsrdl77umb7xyq

What's Next?
  View a summary of image vulnerabilities and recommendations → docker scout quickview
my-project-website-dev-eu:  success: Built 6c0f67b12e981df085f476d81628516616283374442f7afb80c034078279e264:570110252051-eu-west-1

 ❌ Deployment failed: Error: Failed to build asset 6c0f67b12e981df085f476d81628516616283374442f7afb80c034078279e264:570110252051-eu-west-1
    at Deployments.buildSingleAsset (/Users/anentropic/.nvm/versions/node/v18.18.0/lib/node_modules/aws-cdk/lib/index.js:443:11302)
    at async Object.buildAsset (/Users/anentropic/.nvm/versions/node/v18.18.0/lib/node_modules/aws-cdk/lib/index.js:443:197148)
    at async /Users/anentropic/.nvm/versions/node/v18.18.0/lib/node_modules/aws-cdk/lib/index.js:443:181290

Failed to build asset 6c0f67b12e981df085f476d81628516616283374442f7afb80c034078279e264:570110252051-eu-west-1

(at approx 16:07 local time)

this fails without reaching the "IAM Statement Changes" confirmation screen part of the deployment.

and if I look in ECR:

Screenshot 2024-05-21 at 16 09 21

so the single 0 sized image corresponds to the failed build I just experienced

Then if I don't update the content of my Dockerfile perhaps a subsequent successful deploy will not push a new image?

I try again and get the same error as above. Another 0 size image seems to get pushed:

Screenshot 2024-05-21 at 16 48 38

I try a third time and this time the deploy proceeds and the IAM confirmation is reached, but after that eventually fails with the original error from this issue:

"Resource of type 'AWS::Lambda::Function' with identifier 'docker-lambda-command' did not stabilize."

No further images have been pushed by this 3rd failed deployment.

anentropic commented 3 months ago

I try again setting BUILDX_NO_DEFAULT_ATTESTATIONS=1

"Resource of type 'AWS::Lambda::Function' with identifier 'docker-lambda-command' did not stabilize."

as expected

I add a RUN echo "hello" to the Dockerfile and try again with BUILDX_NO_DEFAULT_ATTESTATIONS=1

deployed successfully, with two full-size images:

Screenshot 2024-05-21 at 17 38 21

which I believe are one for my Lambdas (which all have the same code but different cmd) and one for the Fargate task