Closed wolfv closed 1 year ago
Would be good to document this, yes.
I can't say why it was changed. But a good explanation for it is that the outer archive is a Zip file. Hence, the outer archive's index is at the end of the file. So, if you put the info-*.tar.zst
at the end too, you can fetch the metadata with a single fetch (from disk or (HTTP) server).
(In case of the former .tar.bz2
you'd want info
at the beginning of the index-less tarball, of course.)
Hmm, although you don't know beforehand how large the info.tar.zst
file is, right? You mean one would fetch N bytes and hope that it covers both the zip-index
and info.tar.zst
part?
Wouldn't it make much more sense to make sure that you put it at the start? If I understand zip correctly, every file in the zip is preceded by a zip local file header. If we would always put the info
archive at the start of the zip, we could stream the contents of the entire file with a regular GET request. Since the local file header contains all the information you need. There would be no need to inspect the zips central directory at all, which would really simplify the handling. It would actually be similar to how the tar.bz2
files are handled currently.
Having the central directory of the zip at the end really makes things hard.
Obviously too late now because .conda files are already widespread. 🤷
conda-package-streaming has good support for reading partial remote zip archives, and using this to get the info out of a conda in a maximum of 3 remote requests, but it doesn't matter where the info is inside the zip.
It was done so that this transmute implementation https://github.com/conda/conda-package-streaming/blob/main/conda_package_streaming/transmute.py#L72 could buffer the usually-small info in memory while writing the pkg- directly to the zip archive.
There are streaming zip implementations for Python that ignore the central directory, but not the excellent standard library zipfile.
The order doesn't matter for conda-package-handling's create because it asks for a complete list of info and pkg members ahead of time. https://github.com/conda/conda-package-handling/blob/main/src/conda_package_handling/conda_fmt.py
I see that the recent release changed the order of
info
andpkg
archives in the.conda
format (it's mentioned in the Changelog as well). I tried to go through some PRs but couldn't find the reasoning for the change. Would be curious to hear why this was done :)