conda / conda-package-handling

Create and extract conda packages of various formats
https://conda.github.io/conda-package-handling/
BSD 3-Clause "New" or "Revised" License
26 stars 37 forks source link

Explanation of info- & pkg ordering #182

Closed wolfv closed 1 year ago

wolfv commented 1 year ago

I see that the recent release changed the order of info and pkg archives in the .conda format (it's mentioned in the Changelog as well). I tried to go through some PRs but couldn't find the reasoning for the change. Would be curious to hear why this was done :)

mbargull commented 1 year ago

Would be good to document this, yes. I can't say why it was changed. But a good explanation for it is that the outer archive is a Zip file. Hence, the outer archive's index is at the end of the file. So, if you put the info-*.tar.zst at the end too, you can fetch the metadata with a single fetch (from disk or (HTTP) server). (In case of the former .tar.bz2 you'd want info at the beginning of the index-less tarball, of course.)

wolfv commented 1 year ago

Hmm, although you don't know beforehand how large the info.tar.zst file is, right? You mean one would fetch N bytes and hope that it covers both the zip-index and info.tar.zst part?

baszalmstra commented 1 year ago

Wouldn't it make much more sense to make sure that you put it at the start? If I understand zip correctly, every file in the zip is preceded by a zip local file header. If we would always put the info archive at the start of the zip, we could stream the contents of the entire file with a regular GET request. Since the local file header contains all the information you need. There would be no need to inspect the zips central directory at all, which would really simplify the handling. It would actually be similar to how the tar.bz2 files are handled currently.

Having the central directory of the zip at the end really makes things hard.

Obviously too late now because .conda files are already widespread. 🤷

dholth commented 1 year ago

conda-package-streaming has good support for reading partial remote zip archives, and using this to get the info out of a conda in a maximum of 3 remote requests, but it doesn't matter where the info is inside the zip.

It was done so that this transmute implementation https://github.com/conda/conda-package-streaming/blob/main/conda_package_streaming/transmute.py#L72 could buffer the usually-small info in memory while writing the pkg- directly to the zip archive.

There are streaming zip implementations for Python that ignore the central directory, but not the excellent standard library zipfile.

The order doesn't matter for conda-package-handling's create because it asks for a complete list of info and pkg members ahead of time. https://github.com/conda/conda-package-handling/blob/main/src/conda_package_handling/conda_fmt.py