GoogleCloudPlatform / gcr-cleaner

Delete untagged image refs in Google Container Registry or Artifact Registry
Apache License 2.0
805 stars 112 forks source link

Not finding all relevant images #151

Closed Macariel closed 8 months ago

Macariel commented 1 year ago

TL;DR

Not all relevant images are being found and subsequently deleted. Some images are omitted due to unknown reasons.

Expected behavior

All images matching the filter or all untagged images created after the grace period should be listed and deleted

Observed behavior

Running the simple command ./gcr-cleaner-cli -repo europe-west1-docker.pkg.dev/project/repo/repo -grace="320h" -tag-filter-all "^ts_.*" --dry-run gives a list of all images which would be deleted:

However there are multiple problems with the output:

Removing the tag filter has the same outcome as does changing the grace value to bigger or smaller values. The total amount of images changes, so there does not seem to be a hard cap on the amount of images, it must be something different.

Coupled with the fact that we have multi-arch builds, so there's a lot of images in there that cannot be deleted due to dangling parents, this means that we are barely deleting any images in a 3h window.

Looking through the output by running the command with GCRCLEANER_LOG=DEBUG does not list any of the missing images as well.

Debug log output

The debug output is very big and might contain info that I don't want public. 
I'd be happy to filter it if you can give me pointers on what would interest you specifically

How are you running gcr-cleaner?

CLI

gcr-cleaner version

v0.11.1

Environment

I installed with go install github.com/GoogleCloudPlatform/gcr-cleaner/cmd/gcr-cleaner-cli@v0.11.1 and I am using it under linux

The repository in question is very big. There's currently almost about 10TB of images in there, that's why we want to clean this up. The runtime is therefor also pretty slow and with a grace period of 720h we are barely getting the job to run in under 3h. However the dry run is very fast.

Additional information

No response

sethvargo commented 1 year ago

Hi @Macariel - thank you for opening a bug.

Untagged images are only found for images starting with sha:00 up to sha:47 (can clearly be seen because the list is sorted). There are no images in there starting with sha:5... or any other hex number.

Are the images included in the debug output? One of the very first lines will be the full list of images (as JSON) that it found.

Not all tagged images are found. For some reason ts_35121... is found but not ts_32555... even though they have been tagged on the same day

Same question - are those tags present in the list of all things gcr-cleaner found?

My initial guess is that the repo is so large, you're hitting rate limit/quota issues. Those should return an error though, so I'm skeptical.

Assuming you have enough CPUs, you can crank up the concurrency with -concurrency=1000. That will do 1000 operations in parallel, which should speed things up.

Macariel commented 1 year ago

Hi @sethvargo thank you for taking a look!

I've captured the debug output using the following command:

GCRCLEANER_LOG=debug ./gcr-cleaner-cli -grace="320h" -repo europe-west1-docker.pkg.dev/project/repo/repo -tag-filter-all "^(ts_.*)" > all 2>&1 --dry-run

Searching in the file for sha256:5 or ts_32555 yields no results. There are also no errors or warnings for that matter. Looking for sha256:4 gives me results in the huge second line.

As for the rate limit, adjusting the grace period yields different results depending on the value:

Not sure if a rate limit would be dependent on the grace period here, also there are "only" around 50000 images in the repository,doesn't sound like too much for google. Also I can fetch all images with the following gcloud command without a problem (while using the same admin account):

gcloud artifacts docker images list europe-west1-docker.pkg.dev/project/repo/repo --limit 100000 --sort-by "UPDATE_TIME" --include-tags --format=json
sethvargo commented 1 year ago

https://cloud.google.com/artifact-registry/quotas#project-quota

Notably this section:

The Docker Registry API method to list images returns an incomplete list if a repository has more than 10,000 images or tags. This limitation applies to Docker clients that use the Docker Registry API to interact with registries. The limitation does not apply to the gcloud artifacts docker images list command or Artifact Registry API requests.

We use the Docker registry API because GCR cleaner supports all OCI registries. That appears to be the root of the problem here.

Macariel commented 1 year ago

I guess this is not a bug then. If you don't see a good way around it you can close the ticket :thinking: Thank you very much for this insight!