Packages upload to ghcr

Hind-M commented 1 year ago

Checklist

[x] Used a personal fork of the feedstock to propose changes
[x] Reset the build number to 0 (if the version changed)
[x] Re-rendered with the latest conda-smithy (Use the phrase code>@<space/conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
[x] Ensured the license file is being packaged.

conda-forge-linter commented 1 year ago

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

hmaarrfk commented 1 year ago

i'm mostly passing by, but can you share some context of what this is? is there an other issue open about this?

Hind-M commented 1 year ago

i'm mostly passing by, but can you share some context of what this is? is there an other issue open about this?

Hey! So this is related to the wanted feature to add packages upload to Github container registry in addition to anaconda.org. This PR is also related and should probably be merged before.

Hind-M commented 1 year ago

This upload cannot go here. We have to do it via the webservices in order to verify the artifacts.

Oh ok! Where exactly do you suggest doing it? Is it somewhere here or somewhere else or in a another repo? Thanks!

beckermr commented 1 year ago

Somewhere else completely. We'll need to do it either on the heroku server or using a dispatch to github actions.

cc @wolfv for viz

wolfv commented 1 year ago

Yeah, I think there are quite some considerations that we still have to do in terms of where to put this functionality.

Regarding the verification one could also do that via repodata (which is not automatically generated at this point). So the package could be uploaded to the OCI registry, but only added to the repodata after passing the validation step (and otherwise be removed again from the OCI registry). Just a thought.

It would be cool though to start to put together a standalone feedstock that does the upload-after-build to the OCI registry.

If we want to do the upload in the Heroku server, then this is probably the code (https://github.com/conda-forge/conda-forge-webservices/blob/ac84983eb66239c8d3bd6f5fb8b3297f709d2f8d/conda_forge_webservices/webapp.py#L498)

beckermr commented 1 year ago

So the heroku server can't itself do the upload. It'd grind to a halt. We'll need to dispatch out to another service. Or we need to stage into one OCI registry and copy to another via an api call.

wolfv commented 1 year ago

We can also use tags (e.g. 0.25.2_blalba_staging) and then just change the tag.

beckermr commented 1 year ago

As long as we don't ship repodata pointing to tags that'd be fine.

beckermr commented 1 year ago

Actually I'm not sure labels/tags will work. We shouldn't have keys to upload to our registry in feedstocks out in the open. We need a staging area and then a secured copy.

Hind-M commented 1 year ago

IIUC, we could upload to ghcr.io the same way it is done with anaconda.org - using a staging area and then copy to the prod, couldn't we? If so, we could/should keep the upload in the upload_or_check_non_existence.py in this repo and add the copy part (and maybe additional missing stuff) from cf-staging to conda-forge in the webservices (webapp.py)?

beckermr commented 1 year ago

Yes, a staging area could work. However, remember that the copy from cf-staging to conda-forge on anaconda.org is a simple HTTP request made to anaconda.org once the package data has been validated. We never download and reupload packages. So to make the ghcr stuff work on our webservices instance, you'll need to find a similar HTTP API endpoint. A similar HTTP endpoint also needs to return the package hash for validation.

DerThorsten commented 1 year ago

I am trying to figure out what would be needed to move forward with the GitHub OCI upload:

Yes, a staging area could work. However, remember that the copy from cf-staging to conda-forge on anaconda.org is a simple HTTP request made to anaconda.org once the package data has been validated. We never download and reupload packages. So to make the ghcr stuff work on our webservices instance, you'll need to find a similar HTTP API endpoint. A similar HTTP endpoint also needs to return the package hash for validation.

I am relatively new in the world of OCI registries, so forgive me if I am confusing things :) but I tried to look into the specs to find such an API endpoint. The open-container spec mentions an endpoint which might be helpful to avoid a download-reupload

"If a necessary blob exists already in another repository within the same registry, it can be mounted into a different repository via a POST request [...]" https://github.com/opencontainers/distribution-spec/blob/main/spec.md#mounting-a-blob-from-another-repository

beckermr commented 1 year ago

Sure that looks promising but I know nothing about OCI registries. I'll leave this to you and @wolfv to work out. Ideally, we could wrap the copy in the conda oci package @wolfv has going so it is easy to use.

We have some security requirements here related to tokens that I will share with @wolfv privately once the copy is working.

jaimergp commented 1 year ago

I've been thinking about this and doing some research. This is not a definitive assessment but a work in progress. I am not saying all the following is a good idea, but at least it takes us to the realm of what's feasible today.

The main concern right now is how to do staging in a safe way. conda-forge uses the cf-staging Anaconda.org channel where all feedstocks upload their artifacts. If the artifacts pass validation, a webservice copies them from cf-staging to conda-forge. Anaconda.org services will then index all conda-forge packages in the corresponding repodata.json.

Staging serves two purposes then:

Limiting access to the main channel
Avoiding early publication of a problematic artifact

How do we do this with OCI artifacts? The limitations are:

We only have one organization channel-mirrors so far, so feedstocks would get access to the "main" channel. This might not be as problematic as it sounds, but we need to ensure that's the case. [^2]
Artifact metadata needs to be added before the upload, and I am not aware of a mechanism that allows metadata modification after the upload
Copying artifacts from one channel to other involves downloads, uploads and some API calls (definitely more expensive than the single COPY request to Anaconda.org) [^1]
We need to handle our own conda-index equivalent process with remote artifacts

[^1]: I read the OCI spec and apparently it supports the notion of “mounting blobs” from other registries. This means it could mimic the cf-staging to conda-forge setup in Anaconda.org. The Github Packages API doesn’t seem to support mounting though. There are also some issues online about it, and still open.

[^2]: Permission-wise, GH distinguishes between read, write and delete, which means that a properly scoped token used by feedstocks could maybe just write too many things, but in no way delete existing blobs. Note these tokens are NOT fine-grained. There’s also a 30-day restore window if necessary. Deleted packages are available in the Settings UI. Package overwriting shouldn't be possible (it would be a different hash anyway). The risks of a cross-feedstock publication are low as long as we have a validation process in-place.

So, all in all, I think that we can run everything off the channel-mirrors organization. We just need to devise a different staging mechanism. I suggest:

We mimic what Homebrew does and publish our own repodata using a similar approach. [^3]
Come up with a way to mark an artifact as ready for publication, after an upload. Annotations and labels seem to be pre-upload only, but maybe GH has a field we can use, like visibility or something. [^4]
Let feedstocks upload (only upload) to channel-mirrors and have the validation service run the needed checks on the new artifacts.
If it passes, the required metadata is modified accordingly, and the artifact will be published to the repodata in the next scheduled run.
If it doesn't, the required metadata won't be present, and the package will be deleted in the next scheduled run (different workflow than in step 4). Accidental deletions can still be recovered in the 30-day window.

[^3]: See how homebrew does this with 15-min scheduled jobs; even the API is pre-generated JSON deployed to GH Pages in an environment. Their biggest payload is 20MB pure JSON though. These point to sha256 headers in GHCR.io. Search uses algolia too!

[^4]: See this for OCI annotations. I don't know if it can be added after an upload. What about tags? Can these be modified, added or removed? Right now tags encode the version and the build string. The UI does distinguish tagged vs untagged.

Hind-M commented 1 year ago

Come up with a way to mark an artifact as ready for publication, after an upload. Annotations and labels seem to be pre-upload only, but maybe GH has a field we can use, like visibility or something.

I believe labels were superseded by annotations (https://github.com/opencontainers/image-spec/blob/main/annotations.md#back-compatibility-with-label-schema) and these cannot indeed be edited after building the artifact, but there is this interesting solution where we could add annotations to existing artifacts creating a separate ORAS Artifact Manifest referring to the original one, having the same digest, and being in the same repository. I suppose that tags could also be a solution.
Visibility does exist apparently, see listing packages for an organization, and can be public, private or internal.
For the staging strategy, to be sure that I understood correctly, do you mean not using a staging area and distinguish the artifacts which are ready only with metadata/annotations? When you say running everything off the channel-mirrors organization, where would it be? (within the organization of the corresponding GH repository we are packaging for example?).

jaimergp commented 1 year ago

but there is this interesting solution where we could add annotations to existing artifacts creating a separate ORAS Artifact Manifest referring to the original one, having the same digest, and being in the same repository.

That repo (johnsonshi/annotate-registry-artifacts) is indeed interesting. I am concerned about the permissions here, because in principle any feedstock could add the metadata bit to say "yea it is a valid artifact", unless we put that info somewhere else 🤔 Or maybe we need to check.

About visibility, I read a bit more into it and, while it could work, we must notice that:

Warning: Once you make a package public, you cannot make it private again.

So we would have to upload it as private, then run the validation and either publish as public or delete. I don't know if the amount of packages marked as "private" count towards some kind of quota but hopefully the number of artifacts that are marked as such at a given time is a small one.

For the staging strategy, to be sure that I understood correctly, do you mean not using a staging area and distinguish the artifacts which are ready only with metadata/annotations?

Correct, that's my proposal so far.

When you say running everything off the channel-mirrors organization, where would it be? (within the organization of the corresponding GH repository we are packaging for example?).

Maybe a repo like channel-mirrors/index or channel-mirrors/repodata. Maybe this can be published to the OCI registry too (instead of GH pages) but it needs to run on some sort of cronjob anyway and I am assuming the Homebrew folks decided for GH Pages for a good reason.

jaimergp commented 1 year ago

We discussed this approach in the monthly bot meeting and Matt raised a point I had not considered: the repodata.json schema doesn't allow external URLs for packages; it assumes that files will be co-located next to the repodata.json. So either:

we provide a thousand redirection endpoints in the GH Pages "channel" or...
we submit the necessary CEPs to adjust the repodata schema to (optionally) allow for external URLs, which would take precedence over the next-to-repodata assumption

jaimergp commented 1 year ago

@Hind-M and I met with @wolfv today and discussed potential alternatives:

Staging:

Instead of a single organization (channel-mirrors), we can add a second one; e.g. channel-mirrors-staging
Feedstocks upload to staging, and do not have access to production
A cronjob at channel-mirrors will periodically run validation checks on staging, and promote the valid packages to production. If they don't pass, they are deleted.

Repodata publication:

Only for packages in channel-mirrors
It can be served as an OCI artifact, or in GH Pages à la brew.
Need to add a plugin to conda to handle OCI-backed channels. This plugin will also be responsible of "figuring out" the OCI URL for each package in the repodata.json. We might just need an "endpoint" URL in the repodata header instead of per-artifact url.

Some other notes:

$GITHUB_TOKEN can't be used in GHA workflows with packages due to scale problems. Instead, one need to supply a packages:write PAT secret to the workflow. In conda-forge, this is best done via our token app or a bot account so the PAT is not tied to a personal account.
OCI operations (e.g. artifact download) are authenticated. The PAT is only needed to request the token, but the actual operation happens with a different one. This means that we can (theoretically) add more conditions to the per-operation token minting and further reduce the scope to a single package name or stuff like that.

isuruf commented 10 months ago

Instead of a single organization (channel-mirrors), we can add a second one; e.g. channel-mirrors-staging A cronjob at channel-mirrors will periodically run validation checks on staging, and promote the valid packages to production. If they don't pass, they are deleted.

I'm not sure what the difference with this approach and using a cronjob at channel-mirros to download from anaconda.org and push to channel-mirrors org directly.

Hind-M commented 4 weeks ago

Instead of a single organization (channel-mirrors), we can add a second one; e.g. channel-mirrors-staging A cronjob at channel-mirrors will periodically run validation checks on staging, and promote the valid packages to production. If they don't pass, they are deleted.

I'm not sure what the difference with this approach and using a cronjob at channel-mirros to download from anaconda.org and push to channel-mirrors org directly.

Because we want to do it independently from anaconda.org.

conda-forge / conda-forge-ci-setup-feedstock

Packages upload to ghcr #208