Be a good registry client

aidansteele commented 1 year ago

TODO: this issue itself needs to be fleshed out, hopefully based on feedback from registry operators.

This app boils down to being a Docker registry client. Docker registries are expensive to run and most place limits on excessive usage. See:

Docker Hub
ECR Public
(TODO) ghcr.io
(TODO) gcr.io
(TODO) mcr.microsoft.com

Goals:

Minimise end-user latency for ima.ge.cx (people only use fast websites)
Minimise resource consumption / impact on registries (I don't want to be a jerk)
Minimise hosting costs incurred by me (I'm paying for this out of my pocket and don't want it to sting)

Some questions that first come to mind:

When a user requests to view a file, I make Range requests to retrieve ~1MB from the blob that contains the file. Is that acceptable? Or should I mirror them in my own S3 bucket? ($$$ could add up quickly)
Is it acceptable for /api/lookup to query registries (e.g. at /v2/*/manifests) directly? Or should I cache that data in DynamoDB for some period of time? Does putting it behind CloudFront mitigate the problem?
Do any of the answers change if I later #14?
What issues am I not even considering yet?

monken commented 1 year ago

Ask users to sign up for your service
Ask users to provide their API key for Docker Hub (and the others) which you can then use to make the requests
Otherwise, they will only see cached results
Cache files on EFS with a sensible lifecycle policy and mount the EFS volume in your lambda

This sort of infrastructure would be required for private images anyways.

jlbutler commented 1 year ago

This is super cool @aidansteele.

I haven't seen the backend code, but seeing that you're doing ranged gets, definitely wanted to point you at the Streaming OCI project. Maybe you will find a way to leverage this - lazy loading files vs the 1MB chunk fetch might be cool?

You will hit some limits with some popular images for sure, so caching does make sense. I guess the tradeoff here is limiting the amount you're hitting repositories vs the amount of data you want to hold onto (and pay for). Manifests are pretty light, and immutable so theoretically you could cache based on digest, and pull through new manifests only when a tag has moved. On the data front, my hunch is you won't end up storing entire images over time, so the smaller your chunk the better (again SOCI might be of use here, or smaller chunk sizes).

On pricing considerations in particular this is being a good citizen. Depending on the registry you'll also experience API throttles potentially. Should be pretty safe with ECR Public (20/1 TPS and plenty of bandwidth per publisher - 500GB anon and 5TB w/ AWS)... unless you end up going very very viral on a particular publisher's image, you never know. For other registries you might need to pull images to your own holding tank - or ask users for their creds as monken suggested (but that's a whole kettle of fish).

I guess from here, the data caching has some considerations. You could cache your chunks, and time them out on an LRU or something simple. If you look into SOCI, it might be that you still cache lookup but it might be quite different - could be worth modeling both.

Hopefully this is useful, and maybe other folks will have some good input!

jlbutler commented 1 year ago

cc'ing some oci registry friendos to see if i flubbed anything up or if they have other ideas for you @sudo-bmitch @michaelb990 @jdolitsky @imjasonh @sajayantony @jamesmt-aws

imjasonh commented 1 year ago

This looks super cool! I've been tinkering with a similar NextJS frontend for browsing images at https://registry-ui.chainguard.app/?image=cgr.dev/chainguard/static, and @jonjohnsonjr has been making https://explore.ggcr.dev/ -- registry-ui is more focused on linking to attached signatures / SBOMs / attestations, and Jon's also lets you browse the filesystem (example).

I haven't looked deeply into how ima.ge.cx works yet, but I have a couple ideas that might be useful:

registry-ui relies on https://mirror.kontain.me to prefetch and mirror the image layers into a GCS bucket I own, and the UI then makes OCI API calls to the mirror, which also supports CORS and lets the UI skip auth exchanges. This made the frontend a bit easier to build, but it means the first time the image is loaded you have to wait for the mirror to pull it. For smallish images it's not too bad, and the UI and API could do a better job of cooperating so that blobs don't have to be mirrored just to display manifest/config data. If you want to use mirror.kontain.me directly or fork it and run your own, I'd be happy to help there.
There's also https://flatten.kontain.me, which mirrors like above, but also flattens the layers into one layer, which might make browsing a unified filesystem easier. But more than that,...
There's also https://estargz.kontain.me, which is possibly not quite useful, but it's within striking distance of being a SOCI-optimizing mirror, which might be useful. I assume since SOCI and estargz share a lineage it should be fairly easy to bring up a soci.kontain.me to play with if that's interesting.

All of the above have a 24 hour TTL to keep costs low, and eventually I'd like to think about porting them all to Cloudflare R2 so I don't have to worry about egress fees stealing my kids' college funds. 💸

For registry-ui I've also thought about caching more things locally in the browser, manifests especially, or even small blobs, for speed and further cost lowering.

In any case, this looks awesome, and I look forward to seeing what you do with it! 😄

sajayantony commented 1 year ago

Very cool @aidansteele - Are you considering providing history for something like https://ima.ge.cx/golang:1.19 or planningg to add history for different digests. It's one of my pet features I hope we standardize at some point.

Some questions -

How are you planning to distinguish images from different registries? For example golang:1.19 exists in ECR and docker hub and the digest would most likely be the same but need not be.
If you don't cache the data which is being pulled different endpoints/CDNs that most of these public registries are backed by, I do think costs can be an issue as others before me have pointed out,
Lastly does it make sense to compose this with some kind of reverse cache proxy?

imjasonh commented 1 year ago

How are you planning to distinguish images from different registries? For example golang:1.19 exists in ECR and docker hub and the digest would most likely be the same but need not be.

ECR's copy of golang:1.19 has a different name, public.ecr.aws/docker/library/golang:1.19. If you want to load ECR's copy, you can use that specific name.

If you don't cache the data which is being pulled different endpoints/CDNs that most of these public registries are backed by, I do think costs can be an issue as others before me have pointed out,

Lastly does it make sense to compose this with some kind of reverse cache proxy?

Honestly I don't think it's worth over-optimizing for cost, at least while it's got relatively little usage. kontain.me gets a little usage by me and a few others, and costs me $2-3/month. I don't know what explore.ggcr.dev costs, but it can't be that much.

I'd recommend putting spend caps and alerts in place ASAP, and trying not to do anything exorbitantly expensive, but otherwise if it takes off and is wildly popular such that it costs thousands per month, you can shave off costs knowing where the costs actually come from, or find a corporate sponsor for it.

aidansteele / ima.ge.cx

Be a good registry client #13