cloudflare / serverless-registry

A container registry backed by Workers and R2.
Apache License 2.0
994 stars 35 forks source link

Garbage collector #29

Closed IvanDev closed 2 months ago

IvanDev commented 7 months ago

This PR introduces a garbage collector. When we remove an image or tag from the repository, blobs referenced by deleted manifests are not removed from R2. With this new PR, we'll schedule garbage collecting after any modifying operations. GC will wait 10 minutes after any modifications are done to the repository this ensures we'll not start garbage collecting of freshly uploaded blobs without parenting manifest.

We have 2 modes for the garbage collector, unreferenced and untagged:

  1. Unreferenced will delete all blobs that are not referenced by any manifest.
  2. Untagged will delete all blobs that are not referenced by any manifest and are not tagged.

Users can skip the GARBAGE_COLLECTOR_MODE variable which will disable GC.

Some considerations:

gabivlj commented 7 months ago

Thanks for the contribution! I agree we should have some kind of garbage collection to remove dangling layers. However, I might have some concerns with having garbage collection done by a DO alarm or asynchronously.

AFAIU, there is a small chance that there is a new manifest uploaded while the garbage collection is happening and new layers might be removed. I think this is just something we will have to accept for this in some way to keep the registry implementation simple. If we go with this route, users of the registry that use garbage collection really need to know that it comes with some risks.

I am going to propose some things and let me know what you think (just throwing some stuff here for discussion):

IMO I'd go for the GC endpoint and just let anybody that needs GC to explicitly call it.

If we are feeling fancy with DOs we could do these passes in there, but still with workers unbounded plans + ctx.waitUntil() we can extend computation time already by a bit.

Hope you find this overall feedback fair! I am open to have GC just wanting to see your point of view on these 😄.

IvanDev commented 7 months ago

Here are my 2 bits:

  1. I totally agree with having a manual GC request being the safest option. In the recent changes, I added /:name+/collectgarbage route with optional mode=unreferenced|untagged parameter so users can trigger GC whenever they feel safe.
  2. I also added a date check for the uploaded date field on R2 objects, so now when we scan through the blobs and if there are any changes within the last 10 minutes we'll abort GC scan to prevent damage to freshly uploaded layers. This should make the damage by GC even more unlikely. In addition, GC is now scheduled after 30 minutes this should cover for any network delays.
  3. Introducing a lock will also introduce complexity and the potential for a deadlock. So unless we really want it, I guess we can skip it and use the current implementation since it should cover most of the cases.
  4. Unfortunately, we can't minimize the request count to manifests since layers potentially can be referenced from multiple manifests (e.g. with tag/digest manifests). Counting the strong consistency of R2, relatively rare writes to the registry, and cost savings on storage by using even current primitive GC I think it should be good enough for now. As you mentioned, having DO state would make scanning almost x4.5 cheaper, but ROI on investing that much time into all of this complexity probably would be negligible, at least on most use cases with relatively rare writes (and users can even schedule GC manually now).

I agree with you on your concerns and I'm too a bit afraid of over-engineering :) I would prefer to keep the implementation simple and introduce complexity only when we really need it.

gabivlj commented 7 months ago

I like the date change. I'd suggest we remove the asynchronous garbage collection if we have the endpoint. And in that case I do not think DO is necessary (trying to stay within R2 as much as possible here).

IvanDev commented 7 months ago

Sure, let's keep it safe. Had to remove date checks since it's useless now. Let's keep users responsible for their own actions :)

gabivlj commented 7 months ago

I think you will have to run the formatter on this PR and the other https://github.com/cloudflare/serverless-registry/pull/30/files

IvanDev commented 7 months ago

Sure!

martinwang2002 commented 5 months ago

upvote on this

gabivlj commented 2 months ago

Hello folks! Due to garbage collection popular demand I have cherry picked @IvanDev's commit for rebase and add a bit of concurrency safety that I thought could be nice. Thank you for the feedback. https://github.com/cloudflare/serverless-registry/pull/48

gabivlj commented 2 months ago

Cherry picked commit from this PR has been merged here https://github.com/cloudflare/serverless-registry/pull/48. Thank you @IvanDev for the implementation and the discussion!