cljdoc / cljdoc

📚 A central documentation hub for the Clojure community
https://cljdoc.org
Eclipse Public License 2.0
533 stars 78 forks source link

Search results improvements #308

Closed holyjak closed 2 years ago

holyjak commented 5 years ago

Improve results produced by the new search introduced by #85.

Report problems and bad/suboptimal results here. (Check the Known Problems below first, please!)

Known Problems

martinklepsch commented 5 years ago

Not a big deal but one thing I noticed is that searching for metosin json doesn't return metosin/jsonista.

KingMob commented 5 years ago

I'm not sure if search is set up for non-Clojars sources, but I couldn't find tools.reader using either tools.reader or org.clojure/tools.reader.

holyjak commented 5 years ago

Thanks a lot, @KingMob ! It will be fixed in a few minutes. We fetched only the first 10 instead of all 64 artifacts. (Fixed by #314.) @martinklepsch for the record, the metosin json no results problem has been fixed by #310.

gphilipp commented 5 years ago

I'm searching for jackdaw, but the only result that pops up is some version built off a branch https://cljdoc.org/d/fundingcircle/jackdaw/0.6.7-AlexVPopov_patch_1-SNAPSHOT, instead of the latest release: https://cljdoc.org/d/fundingcircle/jackdaw/0.6.6/doc/readme

holyjak commented 4 years ago

I guess the problem is that 0.6.7* > 0.6.6 and the code doesn't really look at the versions to filter out "weird ones". I will look into it, eventually.

escherize commented 4 years ago

Firstly, cljdoc is a really cool tool that I am fond of!

I think it should be a goal that when searching for e.g. lacinia, the "main library" (aka com.walmartlabs.lacinia 0.34.0) shows up first, but what I am seeing is:

Do you agree that this should be considered a bug?

holyjak commented 4 years ago

Hi @escherize, that would indeed be ideal. But the code can hardly guess what is the "main library". What we want to do is take into account the download count, that should push the more popular artifacts up. I have a branch where I started working on this but struggle to configure it so that it actually improves the results. It is not trivial :-(

nha commented 4 years ago

A search for "turtle" "clj-client" "com.turtlequeue" does not return https://cljdoc.org/d/com.turtlequeue/clj-client/0.0.7 (I have not yet found a way to see it)

martinklepsch commented 4 years ago

@nha I think the problem is that only specific group IDs (namely org.clojure) on Maven Central are whitelisted to be included in the search. Reasoning being that we don't want to actually search all of Maven Central :) Maybe we can add the turtlequeue stuff...

https://github.com/cljdoc/cljdoc/blob/1eaaac904b275ebe581f1775a3c4b5bb44e43bdd/src/cljdoc/server/search/artifact_indexer.clj#L39-L47

nha commented 4 years ago

@martinklepsch right that seems like a quick fix, something like this:

http://search.maven.org/solrsearch/select?q=g:%22org.clojure%22+OR+g:%22com.turtlequeue%22&rows=200

I can submit a PR it you agree with the above

holyjak commented 4 years ago

As soon as we get to 3-4 different groups in Maven Central, we should move them from a hardcoded string to a config file / DB table. For now it seems there will not be many and little churn so hardcoding is OK.

martinklepsch commented 4 years ago

@nha I'm ok with hardcoding for now 👍 Maybe use a function to URI encode instead of just manually doing it, that would at least improve readability.

cloojure commented 4 years ago

Hi - I am still unable to find org.clojure/tools.reader or the ns clojure.tools.reader.edn via the search box. See slack question: https://clojurians.slack.com/archives/C8V0BQ0M6/p1577855041031500

seancorfield commented 4 years ago

As a follow up to that tools.reader does find it but it's the last option in the list (and what look like worse matches are higher in the list).

holyjak commented 4 years ago

I know, I am sorry. That should be fixed by #359 but boosting search results so that you get the desired results is sadly quite complicated. I will get back to the PR in the coming weeks and try to finish it.

holyjak commented 3 years ago

com.fulcrologic does not find anything though a search for fulcro shows com.fulcrologic/fulcro in the list.

holyjak commented 2 years ago

@tobias has done a great job of improving Clojars search including boosting results by download counts. We should copy his work - see https://github.com/clojars/clojars-web/issues/719#issuecomment-1019525194 for details

lread commented 2 years ago

FWIW, the regular releases are prioritized over SNAPSHOTS now. See #551.

lread commented 2 years ago

@holyjak I can take a crack at bringing over clojars work.

lread commented 2 years ago

Download count timespan

After taking a peek, it seems that clojars ranks by downloads over all time? Cljdoc's current tracked clojars download stats (which seem to be inactive by the way) seem to be for the last n days (currently configured to 380). I'm feeling the last year (or so) of downloads might be a more relevant metric than over all time? Perhaps some library was once popular, but has been superseded by another... for example honeysql v2 or next.jdbc for examples.

Maven has no download count

Some libs are hosted on maven instead of clojars, notably org.clojure. There are no publicly available download stats for maven central. Perhaps well weigh org.clojure libs very heavily?

Source-based libs have no download count

Not implemented yet (#459), so we won't worry about these for now.

lread commented 2 years ago

So looking a bit deeper into this. Clojars supports lucene search syntax.

Cljdoc currently tries to find the best match without any lucene syntax. This can be a bit tricky/opaque.

I'm thinking going the clojars route makes more sense. Like clojars, we'd search all fields by default, but if you aren't finding what you are looking for you can get specific. We'd limit ourselves to specific fields: group-id artifact-id and pom description (clojars also offers url at and licenses).

One difference between clojars and cljdoc search is that cljdoc presents results as you type. I think the auto-suggest approach works well for cljdoc. I'll experiment with the effect of lucene syntax on auto-suggest.

On the topic of description I find it a bit confusing to free-text search on somethign that is not presented to the user. I might experiment with showing the description in suggest results.

As always, am happy to hear feedback/questions/concerns.

lread commented 2 years ago

I'm thinking going the clojars route makes more sense.

Well Lee, maybe not (hey no one else is responding, so why not? 🙂)

To present results as you type, we need partial match support. So... maybe we won't entirely be taking the clojars search route.

I'll shall ponder and play.

holyjak commented 2 years ago

Thanks a lot for taking over! I had wanted to take up the work again for a long time but never felt I had enough time to really dive in.

I agree with adding extra weight to :org.clojars results, that's what I'd do. It is not perfect because some contrib libs are not actively used anymore - e.g. the jdbc one is replaced by next.jdbc - but it is better than nothing :)

If we also search description then I would add it less weight than matches on group / artifact id.

Yes, partial matches complicate stuff...

lread commented 2 years ago

Thanks @holyjak, I'll do my best!

I think I went a bit overboard with the idea of supporting lucene syntax like clojars does and have since abandoned that notion.

I think this general issue was great for collecting initial feedback. Are you ok with now closing it in favor of creating focused issues like #568?