Propose JSON Patch based incremental repodata update format

dholth commented 2 years ago

Thanks. Do you know if there is a way to get the a history of conda-forge repodata?

beckermr commented 2 years ago

No, I do not, but it should be easy enough to download it at a regular interval and then start testing.

beckermr commented 2 years ago

I had a few other comments/ideas here.

It'd be nice to unify or at least state how the patching format here will work with the repodata patching already done on defaults and conda-forge.
It'd be nice to ensure that this format plays well with shared repodata stores. For example, we could have a repodata shard be a single package in this format that has no to or from hashes and specify that it applies to any repodata file. We'd want to test the performance of adding up a bunch of shards this way if we did such a thing.

dholth commented 2 years ago

conda's repodata patching happens upstream of this. Will have to check its patch format.

The nice thing about a generic patch is that it doesn't have to know anything about conda; all you need are two repodata.json to compare, you could make the patch on a mirror e.g. The idea behind the individual to / from hashes on each patch is that you might also stream them and want to check the chain.

The patched document does hash equal if you normalize with the same json.dumps(sort_keys=True, indent=x) used by conda, but it seems like a lame way to normalize json.

Here's a repodata shard https://github.com/conda-forge/repodata-shards/blob/main/shards/linux-64/0/0/0/adios-python-1.13.1-py36hdd001d2_1006.tar.bz2.json and here is the commit history https://github.com/conda-forge/repodata-shards/commits/main

I suspect you would reach for a less generic solution for aggregating repodata shards into complete repodata.json. The speed would probably be fine, but you can get errors when the "from" document doesn't look like what a JSON Patch expects, and you're repeating the path from the root of the document to the packages dict a lot.

This idea will be fast enough if downloading, decompressing and applying the patch set takes less time than downloading a complete repodata.json minus the size of the patches.

beckermr commented 2 years ago

Hmm ok. I am not fully following, but having multiple repodata patching formats floating around that all do about the same thing seems odd to me.

FWIW, I wrote all of the code that populates conda-forge repodata-shards repo and combines them into a full repodata solution. I am pretty interested in getting that code out of its very specialized state and into something that is maintained properly.

beckermr commented 2 years ago

you can get errors when the "from" document doesn't look like what a JSON Patch expects, and you're repeating the path from the root of the document to the packages dict a lot.

What do you mean by this?

dholth commented 2 years ago

An example patch from the spec that would fail. https://datatracker.ietf.org/doc/html/rfc6902#appendix-A.12

Each part of a json patch is an operation, a path like "a/b/c" to navigate {}["a"]["b"]["c"], and a value.

Reading the spec it seems you might be able to make pretty robust ad-hoc patches for our case, when we are mostly adding to the existing "packages" and "packages.conda" dicts.

Here's a conda hotfix, unfortunately it's just some Python code https://github.com/AnacondaRecipes/repodata-hotfixes/blob/master/main.py; another https://github.com/conda-forge/conda-forge-repodata-patches-feedstock/blob/main/recipe/gen_patch_json.py

Not generic at all https://github.com/conda-forge/conda-forge-repodata-patches-feedstock/blob/main/recipe/gen_patch_json.py#L307

beckermr commented 2 years ago

The conda hot fixing code uses code in conda-build to merge dictionaries AFAIK.

Ahhh this spec won't apply patches to keys that don't exist. I'd argue that were shouldn't use this specific format if it doesn't suit our needs.

I also wrote a fair bit of the current code that generates the conda patching diffs as well. :)

dholth commented 2 years ago

I get that this is also a patch format but I would argue the application is totally different - save bandwidth by updating an exact document to a newer exact document - compared to the hotfix case.

beckermr commented 2 years ago

There's no functional difference between the two. Adding a package for the final repodata is a hotfix of a full package.

Separating the formats limits our future infrastructure. There's no good reason that we shouldn't stream hotfixing updates as well to further optimize downloads.

Otherwise as we hot fix old packages, the older repodata entries will change and invalidate existing downloads/hashes more quickly causing full repodata re-downloads.

We shouldn't adopt a limited spec like this only to have to work around it later for something else.

dholth commented 2 years ago

It seems you are more familiar with the hotfix process than I am. Perhaps something could be done with patches against individual package metadata before it is combined into repodata? But if the whole repodata.json is changed due to a new hotfix, the patch against it versus the previous version of repodata is probably similar in size to the new hotfix? Or are patch_instructions.json always carried out on the client?

You could also generate the described -patch.json against successive versions of this file https://repo.anaconda.com/pkgs/main/linux-64/patch_instructions.json or any other .json served.

There is a second json patch format called JSON Merge Patch. It is a little more forgiving, but less generic because you can't have null values in the target (it removes object keys instead). It probably is feasible to prohibit null in conda repodata and replace them with empty strings or missing keys, but it would be another file format to update.

dholth commented 2 years ago

ran a few conda search -d -vvv <package> and didn't see any patch_instructions.json downloads

beckermr commented 2 years ago

Yeah the patch instructions get applied during the cdn sync.

dholth commented 2 years ago

Does seem like a system that needs attention.

If that used this format you could have a bunch of small patches skip to the next patch if there was an error. There's no treat list as set 'remove elements matching certain strings from a list' or 'ensure elements exist somewhere in list' feature though.

dholth commented 2 years ago

Wrote some testing code against conda-forge's 172MB linux-64 repodata.json https://github.com/dholth/jdiff/tree/tests/test ; on my "slow" Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz laptop it takes about 30 seconds to create each patch, and 5 seconds to apply all 32 patches.

beckermr commented 2 years ago

How much of that 5 seconds is one time io vs a number that scales with the number of patches?

dholth commented 2 years ago

By the way I have tested this with https://python-json-patch.readthedocs.io/en/latest/index.html ; was about half the speed of the rust code.

The io is probably most of the time. Individual patches seem to apply quickly. [] patches took 4s, test data 31-32 patches took 5.8s

beckermr commented 2 years ago

Hmmm. I think there are still some overheads around. How much time does it take to form the entire repodata json from only patches?

dholth commented 2 years ago

You would never want to do it that but make a patch from {} to try

beckermr commented 2 years ago

Let me explain more.

Conda-forge or any channel builds packages one at a time as you know.

Each package has its own repodata entry (which I call a shard).

Currently, repodata is built as follows.

The entire set of packages has all of its repodata recomputed from scratch. This builds the repodata from packages json file.
A set of hotfix patching instructions are applied to that to form the final repodata.
Current repodata is computed
The repodata and new package set are synced to the cdn.

Here is a much more efficient way to do this.

For a new package, compute its repodata shard for the repodata from packages.
If the hot fix instructions have changed, repatch everything. Otherwise, just patch the new shard to make the final repodata entry.
If all of the entries changed, rebuild the final repodata from them from scratch. Otherwise just append the new entry.
Post the new packages and new repodata.

The format and process you are proposing here is really only applicable for the case of a few packages with unchanged repodata hotfixes.

The other thing is that building the full repodata set from a set of shards is very much a useful process (see above) and the speed at which that happens is the worst case maximum latency in any repodata server for an arbitrary hotfix.

I've built code to do the more efficient process. It can serve new repodata for a single uploaded package with no new hotfix instructions in a few minutes. It rebuilds everything from the raw shards in of order 15 minutes or so.

dholth commented 2 years ago

The patch code can run on the CDN. When it sees a new repodata.json it will automatically compare to the previous version and update the patch set. Then conda install benefits without affecting any other systems.

beckermr commented 2 years ago

That doesn't actually solve anything. The whole idea here is to make repodata downloads faster. However due to the process of building repodata and patching, right now new repodata entries from conda forge are not available until about 1-2 hours after the package is first uploaded to anaconda.org.

So what you are proposing would optimize approximately the last few seconds of a full 1-2 hour process.

dholth commented 2 years ago

This is intended to benefit the many conda end users and make the publishers job about a minute slower.

beckermr commented 2 years ago

Conda forge has the most downloads of anything on anaconda.org at this point.

dholth commented 2 years ago

If you are a conda-forge user who has to download repodata.json (or current-repodata.json) then you should be able to download only the changes from the last time you downloaded the whole thing until today.

beckermr commented 2 years ago

We're not communicating clearly over this thread and that's a limit of the medium. For a conda forge user you'll only save a fraction of the delay from when you uploaded your package to when it is available at all.

Thanks for listening.

dholth commented 2 years ago

That problem is interesting but this is meant to solve a different one. This feature is meant to shorten the time between conda install x and x being installed for a regular user who does not produce packages. Or for a mirror/indexing service that wants to be continuously informed of the latest packages without having to re-download the entire repodata every time If-Last-Modified.

I'd be interested to run the metadata generation pipeline some time to take a look.

dholth commented 2 years ago

Could there be a last-day channel that temporarily contains the most recent packages while the main channel is building?

wolfv commented 2 years ago

Hi @dholth I think it's great to see some activity on this issue.

We have been working with zchunk as part of our CZI proposal to improve mamba. zchunk does what JSON-patch does, but in a more generic fashion with some additional guarantees: https://github.com/zchunk/zchunk/

zchunk is already used in e.g. Fedora's dnf so it should be very reliable.

zchunk is also "simple" because it relies only on a single file and HTTP Range Requests. For a given file the client will request the zchunk header (the first initial couple bytes which contain the hash sums of the chunks) and compare that with the chunks that are already downloaded and present on the system. The client will then issue Range Requests to download only the not-yet-present bytes and assemble the complete file on disk.

Each zchunk "chunk" is compressed using zstd and one can (and should) generate a global dictionary for maximum compression efficiency.

To use zchunk for repodata we would use the following sequence of commands:

# Only on the first initialization
zck repodata.json # -> creates repodata.zck
zck_gen_zdict repodata.json # -> creates repodata.zdict, we need to long-time store this file for future use

# On runs after the first one (to use the dictionary for better compression of common keys)
zck repodata.json -D repodata.zdict # -> creates a repodata.zck file

Then we can just upload the repodata.zdict file side-by-side with the repodata.json and it can be chunk-wise updated. The only requirement is that the server supports HTTP Range Requests which e.g. AWS S3 and many others do properly.

PS: You can try zchunk from conda-forge. We also added Windows support to it.

dholth commented 2 years ago

I did try zchunk adding the last 128 versions of conda's main/linux-64/repodata.json and the seconds to create each patch, to the tests branch. The dictionary function didn't work on my OSX machine because zstd is in /usr/local/bin and not in /usr/bin, some os security prevents me from changing this? Nice to see the format is competitive.

(I don't have the same amount of history for conda-forge unfortunately)

 4.4M Apr  6 11:34 repodata.json.zck
 6.3K Apr  6 11:34 zdiffout.txt # times for zchunk to run
 3.5M Apr  6 11:34 repodata.json.zstd # zstd -19. the first repodata.json in the series.
  83K Apr  6 11:34 repodiff.json.zstd # json patches
 6.2K Apr  6 11:34 rdiffout.txt # times for this (r)ust program to run

wolfv commented 2 years ago

@dholth I actually fixed that issue with macOS in this PR: https://github.com/zchunk/zchunk/pull/66

Will need to get it merged ...

you could, as a hotfix, symlink the zstd.

dholth commented 2 years ago

I'll run it in Docker if necessary

dholth commented 2 years ago

You might find these patches interesting, up at https://repodata.fly.dev/ (generated by https://github.com/dholth/repodata-fly). It periodically downloads any changed repodata.json and then updates the patch sets.

I like the basic idea of a series of json patches, but the file format should change to a mostly-append-only format where each element of the current "patches"=[] array is a line, with new patches appearing at the end of the file.

dholth commented 2 years ago

Try this demo. It compares the available patches against your local cache, applies them and checks whether the new file has the target hash. It doesn't write to the local cache but does create a requests-cache http_cache.sqlite https://github.com/dholth/repodata-fly/blob/main/app/update_conda_cache.py

Your local repodata must be one of the versions known by repodata.fly.dev, e.g. newer than 8 April.

Example output. https://gist.github.com/dholth/58c7641161e13b9b67d2e402d46c161f applies 49 patches against https://conda.anaconda.org/conda-forge/linux-64 and gets an exact hash match; uncompressed patch file is 110.31x smaller than uncompressed repodata.json

dholth commented 2 years ago

I may have misunderstood zchunk, it can help you no matter what the contents of your local repodata.json? The main weakness of the patches method is that it only works if we know about the last version you have.

wolfv commented 2 years ago

For zchunk, you keep the zchunk file around (locally). The local zchunk file contains an index with all chunks (indexed by a SHA256 or similar checksum). The index is compared against the index on the server (by first only downloading the first X bytes of the zchunk file). Only chunks which are not present locally are then downloaded and inserted into the zchunk file, which is then extracted to the full repodata.json

Here is a good explanation: https://www.jdieter.net/posts/2018/05/31/what-is-zchunk/

dholth commented 2 years ago

We could consider sorting the packages, packages.conda keys by timestamp instead of by filename. Then the recently changed packages will tend to be in the same part of the file. Wouldn't help with deletes though.

Here is a similar package for Rust + Linux https://crates.io/crates/bita

wolfv commented 2 years ago

@dholth yes, that was exactly part of the plan :)

dholth commented 2 years ago

Will revise and re-open

conda / ceps

Propose JSON Patch based incremental repodata update format #18