GoogleCloudPlatform / gcr-cleaner

Delete untagged image refs in Google Container Registry or Artifact Registry
Apache License 2.0
805 stars 112 forks source link

Using recursive is not working when deployed as a Cloud Run service and invoked via Cloud Scheduler #57

Closed anouarchattouna closed 2 years ago

anouarchattouna commented 2 years ago

I have one repository test having two child repositories nginx & nginx2 in my project my-test-project:

❯ gcloud container images list --repository eu.gcr.io/my-test-project --format json | jq -r '.[].name'
eu.gcr.io/my-test-project/test
❯ gcloud container images list --repository eu.gcr.io/my-test-project/test --format json | jq -r '.[].name'
eu.gcr.io/my-test-project/test/nginx
eu.gcr.io/my-test-project/test/nginx2

Each child repository has one untagged image:

❯ gcloud container images list-tags eu.gcr.io/my-test-project
Listed 0 items.
❯ gcloud container images list-tags eu.gcr.io/my-test-project/test
Listed 0 items.
❯ gcloud container images list-tags eu.gcr.io/my-test-project/test/nginx
DIGEST        TAGS    TIMESTAMP
6dd48dba5945  latest  2021-10-01T18:22:48
42bba58a1c5a          2021-04-13T21:20:40
❯ gcloud container images list-tags eu.gcr.io/my-test-project/test/nginx2
DIGEST        TAGS  TIMESTAMP
aa3bfd8050fd        2021-11-19T16:56:55
42bba58a1c5a  0.1   2021-04-13T21:20:40

I deployed the stack following the setup:

❯ export PROJECT_ID="my-test-project"
❯ gcloud services enable --project "${PROJECT_ID}" \                                                                                                                                                                                 ─╯
  appengine.googleapis.com \
  cloudscheduler.googleapis.com \
  run.googleapis.com
❯ gcloud iam service-accounts create "gcr-cleaner" \                                                                                                                                                                                 ─╯
  --project "${PROJECT_ID}" \
  --display-name "gcr-cleaner"
❯ gcloud --quiet run deploy "gcr-cleaner" \                                                                                                                                                                                          ─╯
  --async \
  --project ${PROJECT_ID} \
  --platform "managed" \
  --service-account "gcr-cleaner@${PROJECT_ID}.iam.gserviceaccount.com" \
  --image "europe-docker.pkg.dev/gcr-cleaner/gcr-cleaner/gcr-cleaner" \
  --region "europe-north1" \
  --timeout "60s"
❯ gsutil acl ch -u gcr-cleaner@${PROJECT_ID}.iam.gserviceaccount.com:W gs://eu.artifacts.${PROJECT_ID}.appspot.com
❯ gcloud iam service-accounts create "gcr-cleaner-invoker" \                                                                                                                                                                         ─╯
  --project "${PROJECT_ID}" \
  --display-name "gcr-cleaner-invoker"
❯ gcloud run services add-iam-policy-binding "gcr-cleaner" \                                                                                                                                                                         ─╯
  --project "${PROJECT_ID}" \
  --platform "managed" \
  --region "europe-north1" \
  --member "serviceAccount:gcr-cleaner-invoker@${PROJECT_ID}.iam.gserviceaccount.com" \
  --role "roles/run.invoker"
❯ gcloud app create \                                                                                                                                                                                                                ─╯
  --project "${PROJECT_ID}" \
  --region "europe-west3" \
  --quiet
❯ export SERVICE_URL=$(gcloud run services describe gcr-cleaner --project "${PROJECT_ID}" --platform "managed" --region "europe-north1" --format 'value(status.url)')
❯ export REPO="eu.gcr.io/${PROJECT_ID}"
❯ gcloud scheduler jobs create http "gcrclean-myimage" \                                                                                                                                                                             ─╯
  --project ${PROJECT_ID} \
  --description "Cleanup ${REPO}" \
  --uri "${SERVICE_URL}/http" \
  --message-body "{\"repo\":\"${REPO}\", \"recursive\":true}" \
  --oidc-service-account-email "gcr-cleaner-invoker@${PROJECT_ID}.iam.gserviceaccount.com" \
  --schedule "0 2 * * 5" \
  --time-zone="Europe/Helsinki"

So I added "recursive":true to the payload --message-body "{\"repo\":\"${REPO}\", \"recursive\":true}", then I manually launched the job. The job has a Success status, and nothing special from cloud run logs:

Info
2021-11-19T16:50:56.701466Z
Cloud Run gcr-cleaner {@type: type.googleapis.com/google.cloud.audit.AuditLog, resourceName: namespaces/my-test-project/services/gcr-cleaner, response: {…}, serviceName: run.googleapis.com, status: {…}}
Default
2021-11-19T17:02:01.455463Z
deleting refs for eu.gcr.io/my-test-project since 2021-11-19 17:02:01.455315674 +0000 UTC
Info
2021-11-19T17:02:02.780201Z
POST200 727 B 1.3 s Google-Cloud-Scheduler https://gcr-cleaner-ascdkoub3a-lz.a.run.app/http
Default
2021-11-19T17:02:08.255180Z
server is listening on 8080

However, no images have been deleted:

❯ gcloud container images list-tags eu.gcr.io/my-test-project/test/nginx
DIGEST        TAGS    TIMESTAMP
6dd48dba5945  latest  2021-10-01T18:22:48
42bba58a1c5a          2021-04-13T21:20:40
❯ gcloud container images list-tags eu.gcr.io/my-test-project/test/nginx2
DIGEST        TAGS  TIMESTAMP
aa3bfd8050fd        2021-11-19T16:56:55
42bba58a1c5a  0.1   2021-04-13T21:20:40

Could you please have a look and tell me what was misconfigured or if this is a bug somewhere? Best!

sethvargo commented 2 years ago

Recursive deletion on GCR can take a very, very, very long time. It's honestly probably still running. The reason for this is that the Docker registry API does not permit partial paging. So to recursively delete your resources, gcr-cleaner has to page over every repository your credential has access to, which includes all public GCR repos.

anouarchattouna commented 2 years ago

Can make sense, but from the Cloud Run Metrics Dashboard, I can see that there is no more running container 5 minutes after launch! Screenshot 2021-11-22 at 09 35 48

BTW, all worked as expected when running it locally: server

❯ docker run -e GCRCLEANER_TOKEN="$(gcloud auth print-access-token)" -p 8080:8080 europe-docker.pkg.dev/gcr-cleaner/gcr-cleaner/gcr-cleaner
Unable to find image 'europe-docker.pkg.dev/gcr-cleaner/gcr-cleaner/gcr-cleaner:latest' locally
latest: Pulling from gcr-cleaner/gcr-cleaner/gcr-cleaner
Digest: sha256:6dd48dba59455e9d0a6cfd7625c7dce2a71c58cb504ca9115b8e70b1a059f287
Status: Downloaded newer image for europe-docker.pkg.dev/gcr-cleaner/gcr-cleaner/gcr-cleaner:latest
server is listening on 8080
deleting refs for eu.gcr.io/my-test-project since 2021-11-22 08:56:20.5558459 +0000 UTC

client

❯ curl -X POST 'http://127.0.0.1:8080/http' -d '{"repo": "eu.gcr.io/my-test-project", "recursive": true}'
{"count":2,"refs":["sha256:42bba58a1c5a6e2039af02302ba06ee66c446e9547cbfb0da33f4267638cdb53","sha256:6dd48dba59455e9d0a6cfd7625c7dce2a71c58cb504ca9115b8e70b1a059f287"]}%
# took 2m 24s

Any thoughts though?

sethvargo commented 2 years ago

That's definitely interesting. What happens if you invoke your Cloud Run job manually? As the creator, you can do something like:

curl <URL_OF_CLOUD_RUN_SERVICE> -H "Authorization: Bearer $(gcloud auth print-identity-token)" -d '{"repo": "eu.gcr.io/my-test-project", "recursive": true}'
anouarchattouna commented 2 years ago

That's definitely interesting. What happens if you invoke your Cloud Run job manually? As the creator, you can do something like:

curl <URL_OF_CLOUD_RUN_SERVICE> -H "Authorization: Bearer $(gcloud auth print-identity-token)" -d '{"repo": "eu.gcr.io/my-test-project", "recursive": true}'

I got no refs found:

❯ export SERVICE_URL=$(gcloud run services describe gcr-cleaner --project "${PROJECT_ID}" --platform "managed" --region "europe-north1" --format 'value(status.url)')

❯ curl "${SERVICE_URL}/http" -H "Authorization: Bearer $(gcloud auth print-identity-token)" -d '{"repo": "eu.gcr.io/'${PROJECT_ID}'", "recursive": true}'
{"count":0,"refs":[]}%
sethvargo commented 2 years ago

Are there any refs left given your successful run above?

anouarchattouna commented 2 years ago

Are there any refs left given your successful run above?

Yes I added new images without tags to each repository:

❯ gcloud container images list-tags eu.gcr.io/my-test-project/test/nginx
DIGEST        TAGS  TIMESTAMP
1a690e51d37a        2021-11-15T16:18:11
❯ gcloud container images list-tags eu.gcr.io/my-test-project/test/nginx2
DIGEST        TAGS  TIMESTAMP
d536cf3289b3        2021-11-20T11:48:17
❯ curl "${SERVICE_URL}/http" -H "Authorization: Bearer $(gcloud auth print-identity-token)" -d '{"repo": "eu.gcr.io/'${PROJECT_ID}'", "recursive": true}'
{"count":0,"refs":[]}%
sethvargo commented 2 years ago

I always get confused with shell escaping in bash. Are you sure that's properly injecting the project id? Just to be sure, can you hardcode it in your command?

anouarchattouna commented 2 years ago

I always get confused with shell escaping in bash. Are you sure that's properly injecting the project id? Just to be sure, can you hardcode it in your command?

sure :)

❯ curl "https://gcr-cleaner-***-lz.a.run.app/http" -H "Authorization: Bearer $(gcloud auth print-identity-token)" -d '{"repo": "eu.gcr.io/my-test-project", "recursive": true}'
{"count":0,"refs":[]}%
anouarchattouna commented 2 years ago

BTW, can you give it a try deploying all the stack and check if you can reproduce?

sethvargo commented 2 years ago

Following your steps above:

First attempt I got an error back from the container:

error 400: failed to list child repositories for "xx": failed to fetch all repositories from registry eu.gcr.io: GET https://eu.gcr.io/v2/_catalog?n=1000: DENIED: Cloud Resource Manager API has not been used in project xx before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/cloudresourcemanager.googleapis.com/overview?project=xx then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.

After fixing that, I can confirm I'm getting the same behavior as you. I'm still digging...

sethvargo commented 2 years ago

Alright, so I've narrowed it down a bit further. The issue appears to be that the service account on Cloud Run doesn't have permissions to list images, therefore the recursive call returns an empty list when querying the catalog. That makes the effective cleanup list [], which is why we're seeing nothing being cleaned up. If you specify the full repo path, it works as expected. I'm still digging into the permissions issue.

sethvargo commented 2 years ago

If you plan on using the recursive functionality, you must also grant the service account "Browser" permissions on the project:

gcloud projects add-iam-policy-binding "${PROJECT_ID}" \
  --member "gcr-cleaner@${PROJECT_ID}.iam.gserviceaccount.com" \
  --role "roles/browser"

Unfortunately there is no more granular permission available in Container Registry. In Artifact Registry, you can scope this to individual repos.

anouarchattouna commented 2 years ago

Thanks for the explanations and for updating the documentation.