conda / ceps

Conda Enhancement Proposals
Creative Commons Zero v1.0 Universal
20 stars 24 forks source link

CEP for the OCI storage of conda packages & repodata #70

Open wolfv opened 5 months ago

wolfv commented 5 months ago

Fill in the technical details of the OCI registry storage for reference.

wolfv commented 5 months ago
jaimergp commented 5 months ago

we should say that conda / mamba / rattler use oci:// as the scheme.

What does a oci:// URL look like? Is it a direct translation of the anaconda.org artifact that gets manipulated? How is the server referred to? Basically, which one is correct:

Is it possible to refer to a layer directly by URL? In case of repodata:

I think this needs to be standardized too.

wolfv commented 5 months ago

Yeah, the nice thing is that the URLs map directly to the OCI registry. There just needs to be some post-processing for the tags.

E.g. instead of

https://conda.anaconda.org/conda-forge/linux-64/repodata.json

We ask for

oci://ghcr.io/channel-mirrors/conda-forge/linux-64/repodata.json:latest

And for packages

https://conda.anaconda.org/conda-forge/linux-64/numpy-1.23.1-h123123.tar.bz2

We ask for

oci://ghcr.io/channel-mirrors/conda-forge/linux-64/numpy:1.23.1-h123123
wolfv commented 5 months ago

One thing that also needs to go into this document is how tags are formatted. Some versions cannot be translated to tags directly because of the rules of OCI registries.

The functions are here:

https://github.com/mamba-org/rattler/blob/a2073aa6b92196c50208d39fc6b6c67469bf7810/crates/rattler_networking/src/oci_middleware.rs#L77-L92

And when a package name starts with _ (also illegal on OCI registry) then we map it to _ -> zzz: https://github.com/mamba-org/rattler/blob/a2073aa6b92196c50208d39fc6b6c67469bf7810/crates/rattler_networking/src/oci_middleware.rs#L158-L161

jaimergp commented 5 months ago

Food for thought: implement subdirs with platform metadata per layer.

jaimergp commented 5 months ago

Some versions cannot be translated to tags directly because of the rules of OCI registries.

For reference, rules are here: https://github.com/opencontainers/distribution-spec/blob/v1.0/spec.md#pulling-manifests

Hind-M commented 5 months ago
* we should say that conda / mamba / rattler use `oci://` as the scheme.

Hmm, AFAIK we are rather using https as scheme in mamba: https://ghcr.io/... Last time I checked, I think curl was complaining about unknown oci protocol...

wolfv commented 5 months ago

Yeah, but regular HTTPs requests don't work. That's why we need a middleware or some other layer that converts oci://... requests to https:// before sending them to cURL.

Hind-M commented 4 months ago

Added a commit but couldn't push it here.

wolfv commented 4 months ago

Looks good, @Hind-M. Maybe you can make a PR against my branch (https://github.com/wolfv/ceps/tree/oci-cep). I can then merge it there, and the PR will be updated.

Hind-M commented 4 months ago

Here you are https://github.com/wolfv/ceps/pull/3. Thanks!

Hind-M commented 2 months ago

The rendered version for convenience: https://github.com/wolfv/ceps/blob/oci-cep/cep-oci.md

Hind-M commented 1 month ago

@conda/steering-council

This vote falls under the "Enhancement Proposal Approval" policy of the conda governance policy, please vote and/or comment on this proposal at your earliest convenience.

If you have questions concerning the proposal, you may also leave a comment or code review.

It needs 60% of the Steering Council to vote yes to pass.

This vote will end on 2024-09-02, End of Day, Anywhere on Earth (AoE).

To vote, please use the form below:

@xhochy (Uwe Korn)

@CJ-Wright (Christopher J. 'CJ' Wright)

@mariusvniekerk (Marius van Niekerk)

@goanpeca (Gonzalo Peña-Castellanos)

@chenghlee (Cheng H. Lee)

@ocefpaf (Filipe Fernandes)

@marcelotrevisani (Marcelo Duarte Trevisani)

@msarahan (Michael Sarahan)

@mbargull (Marcel Bargull)

@jakirkham (John Kirkham)

@jezdez (Jannis Leidel)

@wolfv (Wolf Vollprecht)

@jaimergp (Jaime Rodríguez-Guerra)

@kkraus14 (Keith Kraus)

@baszalmstra (Bas Zalmstra)

beckermr commented 1 month ago

I am emeritus so I cannot vote here, but IMHO the issues around encoding/decoding above need to be resolved before this spec can be passed.

jakirkham commented 1 month ago

It seems like there is a fair amount of discussion still happening here for an active vote. Should we cancel the vote and reschedule after that discussion has reached a conclusion?

Hind-M commented 1 month ago

It seems like there is a fair amount of discussion still happening here for an active vote. Should we cancel the vote and reschedule after that discussion has reached a conclusion?

I think we can wait and see if we reach a conclusion before the voting deadline, and possibly extend it if necessary. If not, we would definitely reschedule yes. I'm not familiar with the usual processes (of voting and such), so if this isn't the correct approach, please let me know and I'll adjust accordingly.

jaimergp commented 1 month ago

This is similar to symbol name mangling I guess? Perhaps we should prefix all packages with a common prefix and a special one for underscores?

I like this proposal by @baszalmstra, but we will need to remirror the whole thing.

Another alternative, as proposed by @wolfv, is to SHA256 encode the names and forget about it. You can map back by checking the internal metadata. This will also require a complete remirror.

But we can also simply SHA256 encode the name of container images that start with an underscore and leave everything else untouched. This wouldn't require a full remirror.

beckermr commented 1 month ago

I think the name length limit is the bigger issue here though. I guess we'll need a fixed length hash of the name.

wolfv commented 1 month ago

Let me clarify some things:

Name attacks are not really possible (for now) since we also mirror the repodata from conda-forge / Anaconda.org. From the repodata, we just use the SHA256 hash to directly reference the right blob. The names & tags for the individual packages are mainly "cosmetic".

Even if we would run our own indexing, we would refer back to the stored index.json file (and the name in there) and not derive back the original name from the OCI image. So I am personally not super concerned by this.

But I can sympathize with a mapping that would disallow any such overlaps.

I also think that re-mirroring might not be a big issue, since we just need to move the names / tags in the OCI registry to the right places (SHA hashes and packages will stay the same).

Hind-M commented 1 month ago

So just to wrap up here:

Should we think of handling exceeding the max limit of characters as well? (I'm not sure if this is something already taken care of).

beckermr commented 1 month ago

We CANNOT only prepend to packages that start with an underscore. This would declare part of the package namespace off limits for all conda users. That action requires at minimum a separate CEP where we codify package names formally. We have to preprend to everything.

And yes we need to handle the max characters limit properly in this CEP.

dholth commented 1 month ago

Where is the specification for the length limit? Do users have business browsing through the raw OCI anyway?

beckermr commented 1 month ago

Tags have a limit of 128 characters: https://docs.docker.com/reference/cli/docker/image/tag/ Names have practical limits. See the implementers note in the OCI spec for pull: https://github.com/opencontainers/distribution-spec/blob/main/spec.md#pull.

It is hard to know if people want to browse an OCI registry. For sure github provides searches based on image names. See for example the existing OCI mirror (https://github.com/orgs/channel-mirrors/packages) based on a beta spec.

Hind-M commented 1 month ago

FYI, and considering the still ongoing discussions, the vote has been postponed to a later date that is yet to be determined.