Closed anouarchattouna closed 2 years ago
Recursive deletion on GCR can take a very, very, very long time. It's honestly probably still running. The reason for this is that the Docker registry API does not permit partial paging. So to recursively delete your resources, gcr-cleaner has to page over every repository your credential has access to, which includes all public GCR repos.
Can make sense, but from the Cloud Run Metrics Dashboard, I can see that there is no more running container 5 minutes after launch!
BTW, all worked as expected when running it locally: server
❯ docker run -e GCRCLEANER_TOKEN="$(gcloud auth print-access-token)" -p 8080:8080 europe-docker.pkg.dev/gcr-cleaner/gcr-cleaner/gcr-cleaner
Unable to find image 'europe-docker.pkg.dev/gcr-cleaner/gcr-cleaner/gcr-cleaner:latest' locally
latest: Pulling from gcr-cleaner/gcr-cleaner/gcr-cleaner
Digest: sha256:6dd48dba59455e9d0a6cfd7625c7dce2a71c58cb504ca9115b8e70b1a059f287
Status: Downloaded newer image for europe-docker.pkg.dev/gcr-cleaner/gcr-cleaner/gcr-cleaner:latest
server is listening on 8080
deleting refs for eu.gcr.io/my-test-project since 2021-11-22 08:56:20.5558459 +0000 UTC
client
❯ curl -X POST 'http://127.0.0.1:8080/http' -d '{"repo": "eu.gcr.io/my-test-project", "recursive": true}'
{"count":2,"refs":["sha256:42bba58a1c5a6e2039af02302ba06ee66c446e9547cbfb0da33f4267638cdb53","sha256:6dd48dba59455e9d0a6cfd7625c7dce2a71c58cb504ca9115b8e70b1a059f287"]}%
# took 2m 24s
Any thoughts though?
That's definitely interesting. What happens if you invoke your Cloud Run job manually? As the creator, you can do something like:
curl <URL_OF_CLOUD_RUN_SERVICE> -H "Authorization: Bearer $(gcloud auth print-identity-token)" -d '{"repo": "eu.gcr.io/my-test-project", "recursive": true}'
That's definitely interesting. What happens if you invoke your Cloud Run job manually? As the creator, you can do something like:
curl <URL_OF_CLOUD_RUN_SERVICE> -H "Authorization: Bearer $(gcloud auth print-identity-token)" -d '{"repo": "eu.gcr.io/my-test-project", "recursive": true}'
I got no refs found:
❯ export SERVICE_URL=$(gcloud run services describe gcr-cleaner --project "${PROJECT_ID}" --platform "managed" --region "europe-north1" --format 'value(status.url)')
❯ curl "${SERVICE_URL}/http" -H "Authorization: Bearer $(gcloud auth print-identity-token)" -d '{"repo": "eu.gcr.io/'${PROJECT_ID}'", "recursive": true}'
{"count":0,"refs":[]}%
Are there any refs left given your successful run above?
Are there any refs left given your successful run above?
Yes I added new images without tags to each repository:
❯ gcloud container images list-tags eu.gcr.io/my-test-project/test/nginx
DIGEST TAGS TIMESTAMP
1a690e51d37a 2021-11-15T16:18:11
❯ gcloud container images list-tags eu.gcr.io/my-test-project/test/nginx2
DIGEST TAGS TIMESTAMP
d536cf3289b3 2021-11-20T11:48:17
❯ curl "${SERVICE_URL}/http" -H "Authorization: Bearer $(gcloud auth print-identity-token)" -d '{"repo": "eu.gcr.io/'${PROJECT_ID}'", "recursive": true}'
{"count":0,"refs":[]}%
I always get confused with shell escaping in bash. Are you sure that's properly injecting the project id? Just to be sure, can you hardcode it in your command?
I always get confused with shell escaping in bash. Are you sure that's properly injecting the project id? Just to be sure, can you hardcode it in your command?
sure :)
❯ curl "https://gcr-cleaner-***-lz.a.run.app/http" -H "Authorization: Bearer $(gcloud auth print-identity-token)" -d '{"repo": "eu.gcr.io/my-test-project", "recursive": true}'
{"count":0,"refs":[]}%
BTW, can you give it a try deploying all the stack and check if you can reproduce?
Following your steps above:
First attempt I got an error back from the container:
error 400: failed to list child repositories for "xx": failed to fetch all repositories from registry eu.gcr.io: GET https://eu.gcr.io/v2/_catalog?n=1000: DENIED: Cloud Resource Manager API has not been used in project xx before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/cloudresourcemanager.googleapis.com/overview?project=xx then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.
After fixing that, I can confirm I'm getting the same behavior as you. I'm still digging...
Alright, so I've narrowed it down a bit further. The issue appears to be that the service account on Cloud Run doesn't have permissions to list images, therefore the recursive call returns an empty list when querying the catalog. That makes the effective cleanup list []
, which is why we're seeing nothing being cleaned up. If you specify the full repo path, it works as expected. I'm still digging into the permissions issue.
If you plan on using the recursive
functionality, you must also grant the service account "Browser" permissions on the project:
gcloud projects add-iam-policy-binding "${PROJECT_ID}" \
--member "gcr-cleaner@${PROJECT_ID}.iam.gserviceaccount.com" \
--role "roles/browser"
Unfortunately there is no more granular permission available in Container Registry. In Artifact Registry, you can scope this to individual repos.
Thanks for the explanations and for updating the documentation.
I have one repository
test
having two child repositoriesnginx
&nginx2
in my projectmy-test-project
:Each child repository has one untagged image:
I deployed the stack following the setup:
So I added
"recursive":true
to the payload--message-body "{\"repo\":\"${REPO}\", \"recursive\":true}"
, then I manually launched the job. The job has aSuccess
status, and nothing special from cloud run logs:However, no images have been deleted:
Could you please have a look and tell me what was misconfigured or if this is a bug somewhere? Best!