conda / ceps

Conda Enhancement Proposals
Creative Commons Zero v1.0 Universal
19 stars 24 forks source link

`.jlap` JSON Patch Incremental Updates for repodata.json #20

Open dholth opened 2 years ago

dholth commented 2 years ago

Describes a system to save client bandwith on conda install when fetching repodata.json updates, based on patch series from specific earlier versions of the file.

dholth commented 2 years ago

If we generate the patch file before it hits the CDN e.g. in conda index, then it can't include the original repodata.json headers. The idea is that a paranoid client checks that repodata.json's Last-Modified or ETag is within a certain range compared to the patches, in case a mirror stops providing the .jlap file but does not delete it. The individual patch lines should have a size limit. If an individual patch is too large we can leave it out. In testing a very large patch happened when a bug computed the patch between the empty .json {} and the entire repodata.json. If we leave out a necessary patch then clients will download the whole .json again. Each patch has a to and from hash. This is so the patch generator can be dumb. It also makes it possible to do more sophisticated path-finding between the revision we have and the latest revision. This might not help for repodata.json which is mostly additive, but it might be able to do something for current_repodata.json that adds and subtracts packages each time. If the .jlap is too large we could drop earlier lines based on hysteresis. For example, if we had a 3MB limit, we could cut the first 2MB whenever the file exceeds 3MB. The file would stay between 1 and 3 MB, but include some continuity when it was reduced in size. We could include a next url. But the client would have to avoid following redirect loops or downloading so many patches that it might as well have downloaded a complete repodata

jezdez commented 1 year ago

@conda-incubator/steering

This pull request falls under the Enhancement Proposal Approval policy of the conda governance policy, please vote and/or comment on this PR.

It needs 60% of the Steering Council to vote yes to pass.

To vote, please leave a regular pull request review (see Files changed tab above > green Review changes button) with one of the following options:

Also, if you would like to suggest changes to the current proposal, please leave a comment below, use the regular review comments or push to this branch.

This vote will end on 2022-10-21.

ocefpaf commented 1 year ago

@dholth I think this is a great idea and I recall Phil Elson wanting to implement something like this way back in 2018 but never got to it. I'm definitely+1.

The only confusion I have is: should this be a CEP? It is a desired improvement but does it have any operational impacts that warrants voting? I may be missing something (read the doc on my mobile phone) but I did not see any reason for this to be a CEP. I'd love to see the PR that implements this merged though.

dholth commented 1 year ago

I like the idea of being able to implement this in conda without worrying about CEPs. It does not change the format of repodata.json at all.

I was able to meet Wolf in July. The C++ json library they are using implements jsonpatch. Mamba still wants to use the same on-disk http cache as conda, which is a question of paths. The CEP does mention other patching schemes, like the one RPM uses that is generic on bytes instead of generic on json, but is only a C library and is not as optimized for HTTP round trips.

We will be able to get fantastic network performance in Python conda without powerloader.

beckermr commented 1 year ago

That's all great and I support it. Let's try it live in conda and with anaconda.org. We'll learn a lot.

I simply don't want to see us declaring standards on this for conda just yet.

jezdez commented 1 year ago

@dholth I think this is a great idea and I recall Phil Elson wanting to implement something like this way back in 2018 but never got to it. I'm definitely+1.

The only confusion I have is: should this be a CEP? It is a desired improvement but does it have any operational impacts that warrants voting? I may be missing something (read the doc on my mobile phone) but I did not see any reason for this to be a CEP. I'd love to see the PR that implements this merged though.

IMO anything that requires cross-organizational collaboration should be captured in a CEP to force consensus making, and so that for example repo developers can refer back to a community standard than falling back to NIH.

If there is no consensus for this running vote to pass (which I only started after I’ve repeatedly asked @dholth if it’s ready to be voted on), that’s a valid outcome as well. It just shows that we need more explicit coordination here. FWIW CEPs are not done in vain, community coordination is best done explicitly through repeatable processes that prevent falling back to lack of coordination in the past.

wolfv commented 1 year ago

I've started the implementation of the jlap format in C++ as well (https://github.com/mamba-org/mamba/pull/2029) – I don't think it's too difficult and I think we'd be happy to support it with mamba (post-1.0) as well.

dholth commented 1 year ago

There are multiple jlap generators in the Implementation link in the CEP. json2jlap.py

In the conda branch, there is a separate (hash).state.json which stores the mtime, actual hash, and upstream hash of repodata.json, as well as headers from the last fetch, an offset for how far we are in the jlap etc. I don't like this because I'm trying to implement "update two files transactionally" without sqlite but it works. Of course mtime and etag go in the state file and not in repodata.json. In the branch (hash) is 5 base32-encoded bytes instead of 4 base16-encoded bytes.

Conda should provide an API for retrieving repodata.json instead of multiple software looking inside its cache. We have this problem with Anaconda Navigator as well.

@wolv I read your comment about having a state machine, and I feel it too. My approach is to handle the common cases but fall back to downloading the entire repodata.json, even if a more complicated implementation could avoid the full download.

wolfv commented 1 year ago

@dholth thanks for the answer. Should we spec out the metadata format and URL hash scheme as well? Should that be in this CEP or another?

I think an API to access these repodata from the Python side would be nice, indeed, however, micromamba also shares the cache and is pure C++ so it wouldn't be useful in that case.

We can also decide that we create different repodata caches for mamba & conda but I think the shared cache has worked fine in the past.

wolfv commented 1 year ago

Also was the switch to 5 base32 encoded bytes instead of 4 base16 encoded bytes random or deliberate? Is there any other reasoning behind it except to make it not collide with the previous hashes / format?

dholth commented 1 year ago

It's random and only to make it not collide. The filenames are the same length after encoding. Of course the on-disk format is not part of the CEP.

One thing I really wanted to experiment with was gzipping the cache. It's a little slower on a very fast developer laptop but it would definitely be appreciated elsewhere (I've got a Linux running off a sd card somewhere)

I also wanted to have a per-user cache instead of per-conda-environment, and possibly keep only the per-conda-environment pickle cache.

wolfv commented 1 year ago

Would gzip buy much vs. just pickle?

I guess the downside of the pickle format is that json-patch won't work as straight forwardly (if there isn't a fast way to restore the original JSON from it).

I've also thought about storing jlap-availability in the additional metadata.

Not sure I like "versioning" repodata storage by randomly changing hashes. We could do something more deliberate, like <hash>.v2.json or something.

dholth commented 1 year ago

Pickle is the conda objects like SubdirData which do not translate back into the exact server-sent json.

It would be interesting to pickle the unmodified json. Of course that would not do much for micromamba.

mariusvniekerk commented 1 year ago

Please use faster compression formats if you're using any of them. Gzip and bzip2 are laughably slow at decompression compared to zstd.
Our files are big enough that decompression speed matters. -1 for pickles as those very tightly couple us to python.

dholth commented 1 year ago

@mariusvniekerk we will benchmark the whole thing, and we have zstandard available because of .conda https://dholth.github.io/conda-benchmarks/#conda_install.TimeInstall.time_explicit_install?conda-package-handling=1.9.0&conda-package-handling=2.0.0a2&p-threads=1&p-threads=5

jezdez commented 1 year ago

Voting Results

This was a standard, non-timed-out vote.

This vote has NOT reached quorum (at least 60% of 16).

It has also NOT passed since it recorded 0 "yes" votes and 1 "no".

dholth commented 1 year ago

FYI I talked to an internal team that also needs the old cache filenames, will update to keep the filenames the same.

dholth commented 1 year ago

This feature is available in conda 23.3.1 with the "--experimental=jlap" flag. It is the most impressive after the third cache expiration - on an empty cache, we fetch repodata.json; the first jlap fetch, fetches the entire remote jlap file; the third fetches only the end of the remote jlap file.

dholth commented 1 year ago

@wolfv @ocefpaf @beckermr @mariusvniekerk Maybe you are not all still voting members, but how do you feel about this CEP now that there is a complete implementation?

dholth commented 10 months ago

@baszalmstra @jaimergp ping

baszalmstra commented 10 months ago

@travishathaway implemented this CEP in rattler. There it is enabled by default which means pixi is using this by default and we have been using this as such for a while.

This may be due to the implementation in rattler but we have not been particularly happy with it. As far as I can tell the algorithm provides a tradeoff between bandwidth vs compute. You transfer significantly less data but computing the full repodata requires some CPU and (in our case) a lot of memory (in the order of GBs). People with low memory (embedded) device often run into out- of-memory issues because of this.

With the defaulting of the zstd compressed files the download overhead has also become significantly less. But I also realize that this is not a scalable solution.

But! I have not yet actively profiled this, or really tried to improve the current situation. I will make some time for that in the next few weeks and report back here with some actual numbers.

Im very curious about how other people experience this. @wolfv, @jaimergp Have you tried using JLAP? Its not implemented in (micro)mamba right?

@travishathaway Do you think there are more avenues we can explore to optimize the performance of the implementation in rattler?

dholth commented 10 months ago

On my machine, memray shows a peak of about 420MB for repodata = json.load(open("/Users/dholth/miniconda3/pkgs/cache/09cdf8bf.json")) which is the conda-forge noarch repodata.json.

Some users do have more compute than bandwidth.

dholth commented 10 months ago

In Python, it might be possible to have a sqlite-backed repodata.json that returns proxy objects for "packages" and "packages.conda", and apply typical "patch": [{"op": "add", "path": "/packages/pre_commit-3.4.0-hd3eb1b0_1.tar.bz2" ... without loading the entire file. (I've timed repodata.json-to-sqlite that processes the full repodata.json each time and it is too much slower). Of course this would also make "expects cached json" programs unhappy.

travishathaway commented 10 months ago

@baszalmstra,

I thought about this for a bit and revisited the Rust implementation. Perhaps one thing that would help is removing the re-ordering of the document that is done in order for the checksums to match at the end. That's just a hunch though.

No matter what, we will always have to at least load the full repodata.json into memory in order to successfully apply the patches. I think that's just the sad reality we must face when choosing to use a JSON file as our database 😭.

Perhaps we could come up with a way to partition repodata into separate files, but that seems like it would only yield an unacceptably complex solution.

Please let us know what happens when you perform a proper profiling. I am interested to hear the results.

dholth commented 10 months ago

@travishathaway It's designed so that the checksum doesn't match at the end. We don't specify the server's json formatting. But we keep track of the on-disk versus putative "logically equivalent to server-side repodata.json with this checksum".

baszalmstra commented 10 months ago

No matter what, we will always have to at least load the full repodata.json into memory in order to successfully apply the patches. I think that's just the sad reality we must face when choosing to use a JSON file as our database 😭.

Yeah but I dont think thats strictly required. Its because we use JSON as the level of abstraction at which we patch the content. We dont have to parse and load the entire JSON to query information from it in rattler by sparsely reading the content.

I haven't though about this enough yet but we could also patch on the byte level using zchunk for instance. The CEP also mentions zchunk. Has that been evaluated? Im very curious about any issues encountered there.

dholth commented 10 months ago

If we look at https://gist.github.com/dholth/cc76ce07f1c6ff099f440fc901bea35b, which are the 'op' and 'path' properties of individual json patch elements, it shows that almost all of the patch operations in the current https://conda.anaconda.org/conda-forge/linux-64/repodata.jlap look like ('add', '/packages.conda/pymeep-1.27.0-mpi_mpich_py310h1234567_2.conda'), and operate on a single package at a time.

In Python, it would be easy to author a lazy object that looked like repodata.json but loaded individual packages from storage by hooking the dictionary index [] operator on packages and packages.conda.

With knowledge of the patch format, it might be reasonable in a less-dynamic language to load objects (individual package records) based on the first part of the patch path, then send the rest of the path to the json patch library.

zchunk is neat. It's hard to implement in pure Python. Running it against the possibly-formatted repodata.json and preserving whitespace is a waste. It would make more individual requests to the server to get the necessary patch information.

One of the first prototypes was in Rust but I found that the pypy version was a bit faster on diffs. IIRC Python's json.loads() is also faster than serde to a generic serde_json::Value When applying the patches, parsing and serializing JSON take all the time; actually applying the patches is very fast because they operate at the logical JSON level.

dholth commented 9 months ago

I've add a two-file scheme to this CEP. Instead of trying to update repodata.json each time, we accumulate patches in a simple overlay. It will be a while before (re-)writing those additions approaches the time needed to reserialize a 220MB conda-forge/linux-64/repodata.json. There is an implementation in conda+libmamba that shows writing and reading patches. It eliminates about half of the time needed (the time spent calling json.dumps() on repodata.json). It will eliminate all repodata.json parsing on the Python side for patches that only add complete new packages. A bit more work is needed to integrate conda with the new two-file API for the solver.