conda / ceps

Conda Enhancement Proposals
Creative Commons Zero v1.0 Universal
19 stars 24 forks source link

Draft CEP for `.conda` package format #42

Open jakirkham opened 1 year ago

jakirkham commented 1 year ago

It would be good to have a CEP that spells out what is in the .conda format as this is missing atm. Especially as we increasingly rely on this and depend on a few tools to manage reading and writing these. Currently the info we have, which could be used for this CEP is...

Would be good to pull this together to provide a single point of truth.

Independently there are some things that we might want to consider to amend the specification like generating/reusing a Zstandard dictionary for faster and more compact compression/decompression and have per file format dictionaries (text files may benefit a lot from this for example).

leofang commented 1 year ago

It'd be nice to also get this page updated: https://docs.conda.io/projects/conda-build/en/latest/resources/package-spec.html

jakirkham commented 1 year ago

Would suggest raising a new conda-build doc issue

dholth commented 1 year ago

So .conda packages are ZIP-format containers with a metadata.json file containing just the version number, and then an info and pkg file that are always .tar.zst even though some earlier documentation hoped to support "any libarchive filter". The order of metadata, info and pkg inside the ZIP does not matter.

Put together the pkg- and info- tarballs have exactly the same contents as old-format .tar.bz2 conda packages. Generally the info/ subdirectory of a .tar.bz2 package goes into the info- tarball of a .conda.

conda-package-handling uses a list of regular expressions to determine which files go into info/, but this list excludes some files that obviously belong in info/ - for example info/LICENSE vs info/LICENSE.txt. We should audit the existing packages to see whether we can drop this behavior and simply include info/ wholesale. Do packages include significant application data in info/ (besides test data, which is already intentionally in info/)?

A regular conda install unpacks both inner .tar.zst and does not use the "easy to inspect just the metadata" feature provided by the info/pkg split. This is still good, because zst is much, much faster to extract compared to bz2.

We might want to standardize whether info- or pkg- gets extracted first, or enforce that one cannot overwrite the other (that no filename appears in both inner tarballs).

Separate from the .conda container is the shared question of what the metadata looks like. This probably has to be a different, longer document.

jakirkham commented 1 year ago

Forget where this was discussed atm, but recall one point of confusion was whether conda_pkg_format_version should be an int or a str. Would be nice to resolve this as part of this work

jaimergp commented 1 year ago

We might want to standardize whether info- or pkg- gets extracted first, or enforce that one cannot overwrite the other (that no filename appears in both inner tarballs).

Yea, clobbered files in info/ (i.e. package overwrites conda metadata) should be prevented with an error by conda-build (and alike) before the artifact is generated.

dholth commented 1 year ago

I don't think the normal way of creating .conda can create clobbered files. It takes a list of filenames and categorizes them into two groups. The check would need to be on extraction.

jaimergp commented 1 year ago

No, but conda-build can infer which files have gotten into info/ and flag those that would result in a clobber error, I think?