Open ialidzhikov opened 2 months ago
I just stumbled over this issue 😄 Does the occur on all shoots or on AWS shoots only? If it is the latter, it might be related to some AWS credential helpers and permissions of the VMs or its service accounts.
I can remember similar issues for gcsweb on prow on GCP when the VMs had a GCP service-account. The application was aware of the service account because of the GCP metadata service and tried to use it. The service account did not have permissions to access the storage buckets (it was a storage bucket in the gcsweb case) at all. Even though the bucket was public, gcsweb could not access it. I could imagine that something similar could happen for other application on other hyperscalers too.
It is just an idea which came to my mind. I did not investigate the registry-cache case at all yet.
The issue occurs on all shoots, even in the local setup.
What we observe is that image indexes and manifests are successfully cached, but the image layers download fails with http.response.status=500
.
I found some weirdness about public ECR:
https://docs.aws.amazon.com/AmazonECR/latest/public/public-registry-auth.html
according to the auth document of public ECR, Amazon ECR Public supports the [Docker Registry HTTP API](https://docs.docker.com/registry/spec/api/), with the exception of the tags API. However, you must provide an authorization token with every HTTP request.
When I tried for example curl -u AWS:<ecr-public-password> https://public.ecr.aws/v2
, it didn't work ({"errors":[{"code":"DENIED","message":"Your Authorization Token is invalid."}]}
). So according to the document above, it must inject token to every curl request when using HTTP API Auth.
I think this is the same reason why distribution registry (I used 3.0.0-beta.1) is not working with public ECR.
However, when it comes to the private ECR, the same curl -u AWS:<ecr-private-password> https://aws_account_id.dkr.ecr.region.amazonaws.com/v2
will actually work. And distribution registry 3.0.0-beta.1 can work as a pull through cache for private ECR.
I can only conclude that this is a limitation with public ECR.
How to categorize this issue?
/area quality /kind bug
What happened: The registry cache for some reason cannot pull blobs from ECR (at least from
public.ecr.aws
).What you expected to happen: The registry cache to pull images from ECR.
How to reproduce it (as minimally and precisely as possible):
Create a Shoot with cache for upstream
public.ecr.aws
Create a Pod from the upstream , for example
public.ecr.aws/nginx/nginx:1.23.0
Make sure the registry-cache fails to pull the blobs
Logs:
Anything else we need to know?: Similar upstream issue: https://github.com/distribution/distribution/issues/4383
Credits to @dimitar-kostadinov for this finding
Environment:
kubectl version
):