gardener / gardener-extension-registry-cache

Gardener extension controller which deploys pull-through caches for container registries.
Apache License 2.0
7 stars 19 forks source link

Investigate why the registry cache cannot pull blobs from ECR #259

Open ialidzhikov opened 1 month ago

ialidzhikov commented 1 month ago

How to categorize this issue?

/area quality /kind bug

What happened: The registry cache for some reason cannot pull blobs from ECR (at least from public.ecr.aws).

What you expected to happen: The registry cache to pull images from ECR.

How to reproduce it (as minimally and precisely as possible):

  1. Create a Shoot with cache for upstream public.ecr.aws

  2. Create a Pod from the upstream , for example public.ecr.aws/nginx/nginx:1.23.0

  3. Make sure the registry-cache fails to pull the blobs

Logs:

time="2024-09-19T14:23:24.143601968Z" level=error msg="response completed with error" err.code=unknown err.detail="unauthorized: " err.message="unknown error" go.version=go1.22.4 http.request.host="10.4.4.82:5000" http.request.id=46e13e3b-667d-44d4-bae2-6166c692b88a http.request.method=GET http.request.remoteaddr="10.3.0.1:2562" http.request.uri="/v2/nginx/nginx/blobs/sha256:f3d3961ba57b97cee8dea2cdc950856e6c3f4f6d1ba2fadaf5bdf069557bc469?ns=public.ecr.aws" http.request.useragent=containerd/v1.7.18 http.response.contenttype=application/json http.response.duration=263.540084ms http.response.status=500 http.response.written=84 instance.id=1c940ca5-d392-476a-8f73-18c5c4b26817 service=registry vars.digest="sha256:f3d3961ba57b97cee8dea2cdc950856e6c3f4f6d1ba2fadaf5bdf069557bc469" vars.name=nginx/nginx version=3.0.0-beta.1
time="2024-09-19T14:23:24.143602427Z" level=error msg="response completed with error" err.code=unknown err.detail="unauthorized: " err.message="unknown error" go.version=go1.22.4 http.request.host="10.4.4.82:5000" http.request.id=5e9c9254-da0f-48dd-be58-c31a66f8cec2 http.request.method=GET http.request.remoteaddr="10.3.0.1:39548" http.request.uri="/v2/nginx/nginx/blobs/sha256:778ddef5c8e3dfac8ba7265cbd22065f975b42a467e899f753d6d42d1b069da4?ns=public.ecr.aws" http.request.useragent=containerd/v1.7.18 http.response.contenttype=application/json http.response.duration=265.722708ms http.response.status=500 http.response.written=84 instance.id=1c940ca5-d392-476a-8f73-18c5c4b26817 service=registry vars.digest="sha256:778ddef5c8e3dfac8ba7265cbd22065f975b42a467e899f753d6d42d1b069da4" vars.name=nginx/nginx version=3.0.0-beta.1
time="2024-09-19T14:23:24.147671677Z" level=error msg="response completed with error" err.code=unknown err.detail="unauthorized: " err.message="unknown error" go.version=go1.22.4 http.request.host="10.4.4.82:5000" http.request.id=6ab4f416-338e-4714-9112-ad7ceb18cf70 http.request.method=GET http.request.remoteaddr="10.3.0.1:43359" http.request.uri="/v2/nginx/nginx/blobs/sha256:78979650788c06290785aaf0b0b200bd5c5e20285eec32c5684d93310ee38b67?ns=public.ecr.aws" http.request.useragent=containerd/v1.7.18 http.response.contenttype=application/json http.response.duration=255.817666ms http.response.status=500 http.response.written=84 instance.id=1c940ca5-d392-476a-8f73-18c5c4b26817 service=registry vars.digest="sha256:78979650788c06290785aaf0b0b200bd5c5e20285eec32c5684d93310ee38b67" vars.name=nginx/nginx version=3.0.0-beta.1
10.3.0.1 - - [19/Sep/2024:14:23:23 +0000] "GET /v2/nginx/nginx/blobs/sha256:78979650788c06290785aaf0b0b200bd5c5e20285eec32c5684d93310ee38b67?ns=public.ecr.aws HTTP/1.1" 500 84 "" "containerd/v1.7.18"

Anything else we need to know?: Similar upstream issue: https://github.com/distribution/distribution/issues/4383

Credits to @dimitar-kostadinov for this finding

Environment:

oliver-goetz commented 1 week ago

I just stumbled over this issue 😄 Does the occur on all shoots or on AWS shoots only? If it is the latter, it might be related to some AWS credential helpers and permissions of the VMs or its service accounts.

I can remember similar issues for gcsweb on prow on GCP when the VMs had a GCP service-account. The application was aware of the service account because of the GCP metadata service and tried to use it. The service account did not have permissions to access the storage buckets (it was a storage bucket in the gcsweb case) at all. Even though the bucket was public, gcsweb could not access it. I could imagine that something similar could happen for other application on other hyperscalers too.

It is just an idea which came to my mind. I did not investigate the registry-cache case at all yet.

dimitar-kostadinov commented 1 week ago

The issue occurs on all shoots, even in the local setup. What we observe is that image indexes and manifests are successfully cached, but the image layers download fails with http.response.status=500.

erfanw commented 1 week ago

I found some weirdness about public ECR:

https://docs.aws.amazon.com/AmazonECR/latest/public/public-registry-auth.html according to the auth document of public ECR, Amazon ECR Public supports the [Docker Registry HTTP API](https://docs.docker.com/registry/spec/api/), with the exception of the tags API. However, you must provide an authorization token with every HTTP request.

When I tried for example curl -u AWS:<ecr-public-password> https://public.ecr.aws/v2, it didn't work ({"errors":[{"code":"DENIED","message":"Your Authorization Token is invalid."}]}). So according to the document above, it must inject token to every curl request when using HTTP API Auth.

I think this is the same reason why distribution registry (I used 3.0.0-beta.1) is not working with public ECR.

However, when it comes to the private ECR, the same curl -u AWS:<ecr-private-password> https://aws_account_id.dkr.ecr.region.amazonaws.com/v2 will actually work. And distribution registry 3.0.0-beta.1 can work as a pull through cache for private ECR.

I can only conclude that this is a limitation with public ECR.