Closed BrunoBonacci closed 2 years ago
Hi @BrunoBonacci:
I can't explain how search results are ranked, but I agree that the current ranking is a problem. The ranking is all defined in https://github.com/clojars/clojars-web/blob/main/src/clojars/search.clj, and all predates my time working on Clojars. I would love to improve this, but haven't taken the time to dig in to it and learn how to tune it. If anyone in the community has experience with this and interest in improving it, I would be happy to help with that. We have a couple of other related issues: #719 #721.
Thanks for your reply.
If it is using Lucene then I do understand what could be the reason. Lucene ranking computes the relevance based on the importance of the term in the overall document (TFIDF Similarity).
clucy uses a virtual _content
field to collate all the fields,
so JARs with less description will rank higher due to the fact that the portion matching over the whole document will be higher.
However, this should be easy to fix (at least in theory). In the older Lucene versions there was the possibility to Boost fields/attributes at indexing time, and at query time. With newer version it is only possible to boost attributes at query time (see query boost examples)
In other words, there is a way to tell Lucene to consider the jar-name
and group-name
more relevant/important than the general description
or the _content
field.
That's easily said than done. I haven't worked with Lucene in more than a decade and don't recall all the details of how to do it. I will need to investigate further.
@timothypratley I know you had taken a look at this issue - did you ever get anywhere with it? If not, I can pick it up.
I did start a branch 3 months ago: https://github.com/timothypratley/clojars-web/tree/tim/fix-search but I never really got anywhere with it sadly :(
No worries @timothypratley! I've upgraded Clojars to use Lucene 8.11 directly instead of Clucy (https://github.com/clojars/clojars-web/pull/817), and that is now in production. Just doing that has improved the search results a little bit, but there is still more work to do. I still need to adjust the similarity and change how we split on hyphens.
I've rewritten search and adjusted the similarity to not care about document length. I've also changed indexing to preserve hyphenated words as tokens along with the split tokens (so clj-time
is tokenized as clj-time
, clj
, and time
). The search results look much more useful now, so I'm closing this.
nice one! :) :+1:
Hi,
could you please explain how search results are ranked in Clojars Search API? I do find often that exact matches are ranked lower than other variants.
Looking at the following documentation page: https://github.com/clojars/clojars-web/wiki/Data#json-search-results
The sample search provided is: https://clojars.org/search?q=incanter&format=json
However, when fetching the results
cascalog-incanter/cascalog-incanter
appears as the first result whileincanter/incanter
is somewhere towards the bottom. I would expect the latter to be the top-ranked jar as it is an exact match.Could you please describe how results are ranked (based on which criteria)?
regards Bruno