conda / ceps

Conda Enhancement Proposals
Creative Commons Zero v1.0 Universal
19 stars 24 forks source link

CEP: sharded repodata #75

Open baszalmstra opened 2 months ago

baszalmstra commented 2 months ago

We propose a new solution to repodata downloading aimed at drastically reducing the time and memory required.

We implemented the use of this proposal in rattler and created a mirror for popular channels containing the new files that are updated at a fixed interval.

📝 Rendered

Preliminary results

Requesting repodata records for a specific set of packages for linux-64 and noarch. Measured on a machine with a 200mbit internet connection.

| requested packages        | records | fresh | cached | JLAP  | sharded (cold) | sharded (hot) |
| ------------------------- | ------- | ----- | ------ | ----- | -------------- | ------------- |
| python + boto3 + requests | 6969    | 2.4 s | 0.58 s | 8.2 s | 0.34 s         | 0.015 s       |
| jupyterlab + detectron2   | 35524   | 2.4 s | 0.63 s | 8.7 s | 0.7 s          | 0.05 s        |
| rubin-env                 | 84254   | 2.9 s | 0.80 s | 8.8 s | 1 s            | 0.13 s        |

Note that with this approach the cache is not all or nothing. So even if the shards index is updated, most likely a user will still have a relatively hot cache.

wolfv commented 2 months ago

Note that in fairness we never got around to implement the faster "2-file" JLAP variant. However, the sharded repodata is a lot "simpler" in terms of updates, so we believe that in rattler this will supersede the JLAP-efforts.

wolfv commented 2 months ago

One thing that just came to mind is that we might want to add some language that says that implementations should ignore any unknown keys. That will enable us to add run_exports, purls, and other keys to the repodata record without breaking old versions of the tools.

wolfv commented 2 months ago

One other thing:

msgpack2json -d -b -i /Users/wolfv/Library/Caches/rattler/cache/repodata/shards-v1/7336fbc738626810350ff13c195012deb91b794840c81f59ffd8a74559102ec3.msg works nowadays for looking at the shards in JSON :)

jezdez commented 2 months ago

Note that in fairness we never got around to implement the faster "2-file" JLAP variant. However, the sharded repodata is a lot "simpler" in terms of updates, so we believe that in rattler this will supersede the JLAP-efforts.

Thinking about this a little more, I think your statement that this proposal is "simpler" needs some elaboration, "simpler" in what areas? Do you mean "faster"?

In my experience, "simplicity" is easier said than done (read: is hard) and given that this introduces a whole new format, uses a different encoding type and seems to be focused on a non-standard hosting type (OCI, from reading your comments), I'm hoping you can provide context about this. It's.. a lot for one CEP, while trying to imagine a way to implement this in conda (which remains a goal with these CEPs after all).

I would appreciate hearing your thoughts on how a rollout would look like for existing conda hosting (e.g. conda-forge) and how current conda users would benefit from this.

wolfv commented 2 months ago

Note that in fairness we never got around to implement the faster "2-file" JLAP variant. However, the sharded repodata is a lot "simpler" in terms of updates, so we believe that in rattler this will supersede the JLAP-efforts.

Thinking about this a little more, I think your statement that this proposal is "simpler" needs some elaboration, "simpler" in what areas? Do you mean "faster"?

Simpler in that no "patching" is needed (you get the real repodata right away). There is also no "state" that needs to be kept - just update the index file and make sure that all the shards are there, which is really straightforward.

It should be faster, too, because, again, we don't need the whole index in the first place. So the download is only ~1 Mb (for linux + noarch). I can download all shards for jupyterlab in 350 ms.

In my experience, "simplicity" is easier said than done (read: is hard) and given that this introduces a whole new format, uses a different encoding type and seems to be focused on a non-standard hosting type (OCI, from reading your comments), I'm hoping you can provide context about this. It's.. a lot for one CEP, while trying to imagine a way to implement this in conda (which remains a goal with these CEPs after all).

We're not at all focused on a non-standard hosting type - we just want to make sure that that works 100% fine. In fact, we host our "repodata-shards" on a traditional S3-style bucket, but a regular file server would work just as well. One just needs to add the repodata_shards.msgpack.zst file and all the shards under /shards/....

I don't think there is anything to worry about with regards to the implementation in conda. We are using msgpack-python from Python to create the index (the API is the same as json.load/json.dumps).

I would appreciate hearing your thoughts on how a rollout would look like for existing conda hosting (e.g. conda-forge) and how current conda users would benefit from this.

You could do it the same way as we do, and host an alternative index on a different subdomain for a testing period? What we do is that our Github action is ingesting existing repodata, creates shards, and stores the (new) shards on the bucket. It's all open source (in the linked repository). So you can try it out today! :)

With the base_url in the repodata_shards.msgpack.zst file we link back to the "original" conda-forge, by the way (so that packages would be downloaded from the official conda.anaconda.org.

wolfv commented 2 months ago

I just tested the sharded repodata vs regular repodata on a slow internet connection. It took 15s for noarch + osx-arm64 in the traditional case, and 900ms for the sharded repodata. Very nice!

wolfv commented 2 months ago

I also wanted to point out that this is an infrastructure improvement that should dramatically reduce indexing times and thereby hopefully pave the way to an almost realtime experience when publishing new packages. Especially when we start using the daily and weekly files.

The changes will be much more minimal and small. I hope you guys are also excited about that prospect :) @jezdez

wolfv commented 2 months ago

And if you want it really simple + join forces, you can use Gateway class from py-rattler:

https://github.com/baszalmstra/rattler/blob/e570cc74e8bce76606e3246926af723ae68461d2/py-rattler/rattler/repo_data/gateway.py#L65-L81

xhochy commented 2 months ago

Real-life note: This will probably work mostly out-of-the-box with the existing conda support in Artifactory except that the shard index would for now be cached indefinitely there (i.e. that is something JFrog would need to implement). It took them ~2months with repodata.json.zst (and then add another 6 months for corporate IT updating their Artifactory version).

baszalmstra commented 1 month ago

@jezdez @xhochy @dholth @wolfv I updated the CEP to:

Please give it another read, I would like to put this to a vote soon.

jezdez commented 2 weeks ago

@conda/steering-council

This vote falls under the "Enhancement Proposal Approval" policy of the conda governance policy, please vote and/or comment on this proposal at your earliest convenience.

It needs 60% of the Steering Council to vote yes to pass.

To vote, please leave yes, no or abstain as comments below.

If you have questions concerning the proposal, you may also leave a comment or code review.

This vote will end on 2024-07-16, End of Day, Anywhere on Earth (AoE). This is an extended voting period due to summer holiday time in the Northern Hemisphere.

baszalmstra commented 2 weeks ago

yes

wolfv commented 2 weeks ago

Please use the following form to vote!

@xhochy (Uwe Korn)

@cj-wright (Christopher J. 'CJ' Wright)

@mariusvniekerk (Marius van Niekerk)

@goanpeca (Gonzalo Peña-Castellanos)

@chenghlee (Cheng H. Lee)

@ocefpaf (Filipe Fernandes)

@marcelotrevisani (Marcelo Duarte Trevisani)

@msarahan (Michael Sarahan)

@mbargull (Marcel Bargull)

@jakirkham (John Kirkham)

@jezdez (Jannis Leidel)

@wolfv (Wolf Vollprecht)

@jaimergp (Jaime Rodríguez-Guerra)

@kkraus14 (Keith Kraus)

@baszalmstra (Bas Zalmstra)

jakirkham commented 6 days ago

Weird. Had checked the box above a few days ago and it appears to have disappeared. Rechecked it

wolfv commented 6 days ago

@jakirkham strange indeed. I can see your edit in the edit history, but it didn't seem to "tick" the box...

wolfv commented 6 days ago

@chenghlee @CJ-Wright @mbargull last chance to vote :)

mbargull commented 5 days ago

I am thankful that someone is working on the repodata implementation since the current one (one big blob per channel/subdir) became unwieldy a good while ago (at least in resource-constrained environments). As far as I can tell, this proposal is a good starting point but still needs refinement (which I expect to get clearer once more people test it out in rattler and conda-index, etc.). Unfortunately, I was not able to take the time to do a proper review and consult in the refine process here and as such opted to abstain.