archlinuxfr / package-query

Query alpm database and AUR
69 stars 17 forks source link

Sort search results by relevance #70

Closed AndrewBelt closed 8 years ago

AndrewBelt commented 8 years ago

In the command package-query -SAs fish, results are sorted alphabetically by name, but this is sometimes undesirable, as the exact match fish is listed fifth in this list. A string metric could be used, like sorting the Levenshtein distance from the search term, although something a bit simpler could solve the problem as well. I believe this would improve the user experience of the yaourt package when searching for the exact names of packages.

f2404 commented 8 years ago

Hi @AndrewBelt you can use the --nameonly option when searching for the exact name. Will it solve the problem?

f2404 commented 8 years ago

UP: Another thing to try (together with --nameonly) could be using regex to limit the name variants - the following command produces a single result:

$ package-query -SAs ^fish$ --nameonly
community/fish 2.2.0-5
    Smart and user friendly shell intended mostly for interactive use
AndrewBelt commented 8 years ago

Yes, actually, the regex idea helps about half of the time when I'm installing an exact package by name. Sometimes though, I want to install a package like qt4 but I don't know the exact package name. I search qt since qt4 is very close by a string metric, but the qt4 package is hundreds of lines below.

I just noticed that neither the archlinux.org package search nor the aur.archlinux.org search handle search-by-relevance. (e.g. firefox is after arch-firefox-search and bluegriffon on a search for "firefox" although there is a handy exact match box.) Perhaps if people are interested, I could first try digging in the web code of the web searches themselves to try to add this feature.

f2404 commented 8 years ago

Actually, I think I like the idea of the relevance sorting in general. Maybe it's worth adding into package-query as one of the possible sort options. I will try to implement it in the code.

For the web search, it also would be nice to have it. For example, ubuntu packages search (http://packages.ubuntu.com) does seem to have such ability.

f2404 commented 8 years ago

However, I'm not sure that the Levenshtein distance would fit. Assuming we are searching "abc", the Levenshtein distance for "xyz" is 3, and for "abcdefg" it's 4, though it's obvious that "abcdefg" is closer to our target...

f2404 commented 8 years ago

Maybe, of combination of methods should be used: first, the longest common substring (LCS) length should be defined, second, if several results show the same LCS length, then the Levenshtein distance (LD) should be counted.

For the fish search, the results order should be the following:

f2404 commented 8 years ago

Hi, I've implemented this kind of relevance sort - please take a look at https://github.com/f2404/package-query/tree/70_rel_search

The results are below:

$ ./src/package-query -Ss fish --sort=r
community/fish 2.2.0-5
    Smart and user friendly shell intended mostly for interactive use
community/catfish 1.3.3-3
    Versatile file searching tool
extra/bluefish 2.2.7-1
    A powerful HTML editor for experienced web designers and programmers
community/libfishsound 1.0.0-1
    A simple programming interface that wraps Xiph.Org audio codecs
community/fillets-ng 1.0.1-6
    Port of the wonderful puzzle game Fish Fillets
community/python2-jellyfish 0.5.0-1
    A python library for doing approximate and phonetic matching of strings
community/perl-crypt-blowfish 2.14-3
    Perl/CPAN Module Crypt::Blowfish : XSbased implementation of Blowfish
community/fillets-ng-data 1.0.1-1
    Data files for the port of the wonderful puzzle game Fish Fillets
community/zsh-syntax-highlighting 0.4.0-1
    Fish shell like syntax highlighting for Zsh
AndrewBelt commented 8 years ago

Just tested this out, it works really well, with both -Ss and -As. In my opinion, this satisfies this feature request if it's merged, but if you're looking for one more thing to add, I'd suggested weighting results higher if they appear at the beginning of the package name.

Example: icewm should be higher than vice in my opinion. package-query -Ss ice | head

extra/vice 2.4-8
    The Versatile Commodore 8-bit Emulator
extra/icewm 1.3.8-4
    A Window Manager designed for speed, usability, and consistency
extra/spice 0.12.6-1
    SPICE client and server
extra/libice 1.0.9-1 [installed]
    X11 Inter-Client Exchange library
extra/goffice 0.10.24-1 [installed]
    A library of document-centric objects and utilities built on top of GLib and Gtk+

Also, shouldn't the longest common substring (LCS) always be the length of the search query, since you're sorting a list of packages with the entire search query in the name? In that case, you could fix your above problem by weighting by inverse length of the string instead of calling lcs_length. Just a thought.

f2404 commented 8 years ago

@AndrewBelt Thanks!

Regarding your comments:

icewm should be higher than vice in my opinion

I'm not sure about that... vice's distance to the search string is shorter than icewm's. Do you think the results starting with the search string should go first, even if their distance is much longer that other results' ones?

shouldn't the longest common substring (LCS) always be the length of the search query

No, since the search is performed not only within packages names, but also within their descriptions. When using the --nameonly option, however, your assumption will be correct. The thing is that the sorting itself is applied to the packages names, not their descriptions. This could result in a package with a non-relevant name being put very low even if its description would be perfectly relevant. I'm thinking that eventually this type of sorting should always be used together with --nameonly - because otherwise, the results could be very surprising...

f2404 commented 8 years ago

Closing the issue for now... the implementation is still open for your suggestions and improvements!