frewsxcv / rust-crates-index

Rust library for retrieving and interacting with the crates.io index
https://docs.rs/crates-index/
Apache License 2.0
72 stars 37 forks source link

Scaling to millions of crates? #77

Closed kornelski closed 1 year ago

kornelski commented 2 years ago

crates-io is growing, and eventually it will be too slow and cumbersome to read the whole index into memory.

For now I've seen two types of uses of this crate:

  1. Read a few crates individually. This use-case is easy and will work fine for an index of any size.
  2. Read the whole index into memory (e.g. hashmap indexed by crate name) and process it in more complex ways, e.g. resolve (reverse) dependencies or extract some statistics. This will need to change.

How should this crate adapt to this? What are the use-cases for reading all/most of the crates?

pinkforest commented 2 years ago

I could help working on this.. I've been wanting / needing API with guarantees and looking up exact version I did use the old API in specific to find out all the duplicates across index when I cleaned them up in January https://github.com/frewsxcv/rust-crates-index/issues/73 Maybe as a new API on parallel to existing .. ? I am not sure how long the git index will continue due new HTTP Index - this may force on the issue New breaking version could use HTTP index that is faster too that provides specific explicit lookups EDIT: I miscommunicated initially thus edited - sorry :blush:

kornelski commented 2 years ago

What's your proposal?

pinkforest commented 2 years ago

I'll have to map others use-cases a bit more and maybe I can come up with something more concrete and less niche than my own to help avoid that reading into memory problem - I think some helper traits / impls could be useful that can be used in Iterable

p.s. I've been hoping that crates .io would have gone something like GraphQL that would enable this stuff like GH GraphQL API does