Open knbr13 opened 8 months ago
Great issue, thank you! The search function is very naive and needs a major overhaul. For now I recommend you use the more reliable https://github.com/Link-/starred_search
We need to improve the search functionality not by adding more conditions to the function, but by using something more robust. The other extension I referenced uses https://github.com/lucaong/minisearch which is super good and does very lightweight and fast indexing. It also has robust implementations of different search strategies.
If you can find a Go package that offers the same capabilities, let's explore using that instead of reinventing the wheel here.
Okay, I'll check out the suggested alternative.
I do have a quick question regarding the search functionality.
Why not consider using the strings.Contains
function for matching the search value?
For example:
strings.Contains(repo.Name, "go")
to find matches like ["go-github", "go-cleanhttp", "go-retryablehttp"]
.strings.Contains(repo.Name, "http")
==> ["go-cleanhttp", "go-retryablehttp", "httprate"]
.strings.Contains(repo.Name, "git")
==> ["go-github"]
.
The same works with repo topics and description.It seems more direct and addresses the issue of unintended matches with minimal complexity. For this tool, I think it's more than enough. What are your thoughts on this?
@knbr13 - works for me, wanna create a PR so that we can test it out?
yeah for sure, I'll update the code, update the tests, then create a PR.
Hello
Link-
Happy New Year! I hope you are well.I have been enjoying using this useful tool, but after multiple uses, I have identified an opportunity for improvement. I will illustrate the problem firstly through an example.
Example:
My GitHub username is "knbr13," and the provided value for the flag
--find
is "git." The command I used in dev mode is as follows:Currently, I have 47 starred repos on my profile, which is why I set the limit to 50 to see all matched repos.
The results show 45 repos, but out of all my starred repos, only 2 are related to Git. Therefore, the
expected output
is 2 repos, while theactual result
is 45 repos.Problem:
When the user searches for something that contains a few letters (as in the case above, "git" with 3 letters), there's a high chance to match a lot of repos. This is because the code uses the
fuzzy.LevenshteinDistance
function for string comparison, and the result of this function when comparing two distinct strings with few letters is little (almost in the match rangerank >= 0 && rank <= MAX_FUZZY_DISTANCE
).For example:
fuzzy.LevenshteinDistance("git", "is")
The return value (difference) is 2, so it is considered a match. The word "is" is included in a lot of GitHub repo's description. This is just one example, and there are many other similar cases.The use of the priority queue partially solved this problem.
Why Partially?
Even though repos matching by repo name have higher priority, some repos match by name because they include '-' or '_', while the repo name is significantly diffferent than the searched value.
Example:
One of my starred repos is
"go-cleanhttp"
byhashicorp
, thestrings.FieldsFunc
function splits the repo name into 2 strings,["go", "cleanhttp"]
.While comparing repo names (in the loops), this function will be called:
fuzzy.LevenshteinDistance("git", "go")
// "git" is the needle // "go" is the word, The return value (difference) is 2, which statisfies the match condition, so the priority of this item in the priority queue is1000
since it matched by repo name, but"git"
and"go-cleanhttp"
are so different, if the user searched for"git"
, he is absolutely not expecting something like"go-cleanhttp"
.Note
I have some ideas to potentially solve this issue. However, I would like firstly to confirm if you are interested in addressing this problem. If so, I'm willing to contribute and create a pull request once a solution is developed.