RFC: Ranking algorithm for API searches

BrunoBonacci commented 3 years ago

Hi,

could you please explain how search results are ranked in Clojars Search API? I do find often that exact matches are ranked lower than other variants.

Looking at the following documentation page: https://github.com/clojars/clojars-web/wiki/Data#json-search-results

The sample search provided is: https://clojars.org/search?q=incanter&format=json

However, when fetching the results cascalog-incanter/cascalog-incanter appears as the first result while incanter/incanter is somewhere towards the bottom. I would expect the latter to be the top-ranked jar as it is an exact match.

Could you please describe how results are ranked (based on which criteria)?

regards Bruno

tobias commented 3 years ago

Hi @BrunoBonacci:

I can't explain how search results are ranked, but I agree that the current ranking is a problem. The ranking is all defined in https://github.com/clojars/clojars-web/blob/main/src/clojars/search.clj, and all predates my time working on Clojars. I would love to improve this, but haven't taken the time to dig in to it and learn how to tune it. If anyone in the community has experience with this and interest in improving it, I would be happy to help with that. We have a couple of other related issues: #719 #721.

BrunoBonacci commented 3 years ago

Thanks for your reply.

If it is using Lucene then I do understand what could be the reason. Lucene ranking computes the relevance based on the importance of the term in the overall document (TFIDF Similarity). clucy uses a virtual _content field to collate all the fields, so JARs with less description will rank higher due to the fact that the portion matching over the whole document will be higher.

However, this should be easy to fix (at least in theory). In the older Lucene versions there was the possibility to Boost fields/attributes at indexing time, and at query time. With newer version it is only possible to boost attributes at query time (see query boost examples)

In other words, there is a way to tell Lucene to consider the jar-name and group-name more relevant/important than the general description or the _content field.

That's easily said than done. I haven't worked with Lucene in more than a decade and don't recall all the details of how to do it. I will need to investigate further.

tobias commented 2 years ago

@timothypratley I know you had taken a look at this issue - did you ever get anywhere with it? If not, I can pick it up.

timothypratley commented 2 years ago

I did start a branch 3 months ago: https://github.com/timothypratley/clojars-web/tree/tim/fix-search but I never really got anywhere with it sadly :(

tobias commented 2 years ago

No worries @timothypratley! I've upgraded Clojars to use Lucene 8.11 directly instead of Clucy (https://github.com/clojars/clojars-web/pull/817), and that is now in production. Just doing that has improved the search results a little bit, but there is still more work to do. I still need to adjust the similarity and change how we split on hyphens.

tobias commented 2 years ago

I've rewritten search and adjusted the similarity to not care about document length. I've also changed indexing to preserve hyphenated words as tokens along with the split tokens (so clj-time is tokenized as clj-time, clj, and time). The search results look much more useful now, so I'm closing this.

timothypratley commented 2 years ago

nice one! :) :+1:

clojars / clojars-web

RFC: Ranking algorithm for API searches #806