frewsxcv / rust-crates-index

Rust library for retrieving and interacting with the crates.io index
https://docs.rs/crates-index/
Apache License 2.0
72 stars 37 forks source link

NamePermutationIterator #143

Closed ToBinio closed 1 year ago

ToBinio commented 1 year ago

This PR adds the in #75 suggested NamesIter

open Questions:

note: I am not really happy with the code but every attempt of improving it just made it worse...

Byron commented 1 year ago

Great work, thank you!

I have adjusted the structure and streamlined the tests a little.

While playing with the tests I noticed that it prefers _ over -. The goal here would be to optimize the output of the iterator to have the highest possible probability of producing a name that actually exists. So if _ would be most common, that would be the way to go but what if they are not?

I happen to have a database available with all available 121k crate names. The tally of - and _ is as follows:

Thus - should be preferred when creating the names - even though going through 32k permutations currently takes 2ms, very little compared to the time that is burned when accessing the index. This makes finding good names fast paramount, so I inverted the logic to prefer -.

To understand better of what we are dealing with in terms of max-permutations, I counted the highest amount of - OR _ of all 121k crates, with the result being 8. Thus I reduced the total amount of possible permutations and made sure we fail gracefully.

Please find the original data here if you want to poke at it yourself (importing to SQLite while asking ChatGPT for queries works neatly): crate.csv.zip

Further, I think the first name produced should always be the input itself - I gave it a shot, please take a look. Maybe you have a smarter way of doing it - this way is quite acceptable performance wise as it makes the iterator 10% slower, and that's in the cards after making it nearly twice as fast.

should the iter return &str?

It's can't be an iterator anymore then, as lending iterators aren't supported with Rust's current borrow checker, unfortunately. Otherwise, definitely, but without that I think it's preferable to have it allocate for better usability.

The alternative would be to hand in &mut String, but I think it's not worth the reduced usability - 1ms for 32k permutations is good enough given that making that many queries to the index will cost many many seconds.

Byron commented 1 year ago

If you want, you could contribute an example for Names that shows how to use it to implement a fuzzy crate_ lookup while limiting the amount of lookups to some number, just to show how much control the caller now has.

ToBinio commented 1 year ago

So if _ would be most common, that would be the way to go but what if they are not?

I thought about that and then quickly forgot it again... and thanks for the interesting data

If you want, you could contribute an example for Names that shows how to use it to implement a fuzzy crate_ lookup while limiting the amount of lookups to some number, just to show how much control the caller now has.

do you mean simply adding a .take()?

another possible optimation could be to prefer all - and _, so it firstly returns the input then all _, and third all - and only then loop through all special combinations. this should be doable by going through the solutions backwards. starting at 0 then jumping back to max_count but this would once again play with the -_ preference

Byron commented 1 year ago

do you mean simply adding a .take()?

Yes, that would do, but that's just one of many ideas to indicate the flexibility that comes with it. Also to clarify, I meant that the contribution could either be a doc-test or maybe even an adjustment of one of the existing examples as long as there is user input.

another possible optimation could be to prefer all - and _, so it firstly returns the input then all _, and third all - and only then loop through all special combinations.

Definitely! That would be even better to have as these would be the most likely anyway. Ideally, we would then have the sequence input, all-hyphens, all_underscores, …all remaining permutations…, with built-in deduplication. I'd prefer a simple implementation over the most optimized one.


Thinking about it, you could also implement the count() method so that it computes the count in advance and avoids actual iteration. This should be documented so people know that they can easily implement their own limits prior to hitting them without paying for actually generating names.