Search quality - Githubissues

xaqq commented 9 years ago

Hello,

When searching for zmq on the dub package registry (http://code.dlang.org/search?q=zmq) only one result is returned: zmq-d.

There is however another package, zmqd (the one I was looking for) that exists (http://code.dlang.org/packages/zmqd), but the search didn't return it. I believe this to be an issue.

Thanks

MartinNowak commented 9 years ago

We should setup an ElasticSearch instance to solve these issues. This would also allow to search in the readme, fix spelling and provide suggestions.

MartinNowak commented 9 years ago

Just found https://github.com/intrica/elasticsearch-d, didn't find it using the search though :(. Maybe @intrica can help us setting up an ElasticSearch server for the dub-registry?

s-ludwig commented 9 years ago

In the meantime (and as long as we are as small as now), we could also go for the brute force route and simply scan through all packages linearly. I could quickly add something using levenshteinDistance and see how fast it is.

s-ludwig commented 9 years ago

A simple version is running now, seems to be fast enough for the moment. It still needs some tweaking, though (e.g. any three letter sequence will match any three letter word).

s-ludwig commented 9 years ago

Should be relatively usable now. It would be nice to also match parts of a word, so that for example "elastic search" matches "elasticsearch".

MartinNowak commented 9 years ago

I could quickly add something using levenshteinDistance and see how fast it is.

Guess you don't know this? https://github.com/D-Programming-Language/dub/blob/ad9b43d6a150b53872db73c161e745c000cf8577/source/dub/internal/utils.d#L246 https://github.com/D-Programming-Language/dub/pull/453

MartinNowak commented 9 years ago

A simple version is running now, seems to be fast enough for the moment. It still needs some tweaking, though (e.g. any three letter sequence will match any three letter word).

And there is an easy opportunity to make the function much faster. Also there is a more space efficient variant of the algorithm. http://en.wikipedia.org/wiki/Levenshtein_distance#Iterative_with_two_matrix_rows Issue 13834 – make levenshteinDistance @nogc

s-ludwig commented 9 years ago

Guess you don't know this?

Oh, nice suprise ;) No I just now stumbled over parts of an earlier discussion, but basically missed everything that happened around December.

dmonagle commented 9 years ago

I really should improve my email habits. I only just noticed this.

Elastic search is pretty awesome and very easy to set up. The only downside to it is obviously into structure to run the server. They recommend you use a cluster of three is in service, however I have many times just run it as a single instance when there is only a small amount of data to index as rebuilding the whole index and scratch only takes minutes in the case of a small database.

That's the catch though, you need to deploy an elasticsearch instance, preferably sandboxed in some way to prevent it from hogging resources. My job projects are currently deployed using CoreOS so I just run elasticsearch in a docket container next to it. Works beautifully.

If you ever serious about doing this, I'd be happy to help. I think the D site and the dub registry could do with a dose of modernisation :-)

MartinNowak commented 9 years ago

Sounds good, and any help making the websites nice is highly appreciated. @s-ludwig can we start this by setting up an elasticsearch server? Also the download stats require an update of MongoDB to 2.6.

In the long run docker might indeed be interesting to deploy the registry.

MartinNowak commented 9 years ago

Mmh, a quick attempt ran into the mentioned memory issues.

Mar 20 14:38:56 kvm1.dawg.eu systemd[1]: PID file /var/run/elasticsearch/elasticsearch.pid not readable (yet?) after start.
Mar 20 14:38:56 kvm1.dawg.eu elasticsearch[7059]: OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fd081000...no=13)
Mar 20 14:38:56 kvm1.dawg.eu elasticsearch[7059]: #
Mar 20 14:38:56 kvm1.dawg.eu elasticsearch[7059]: # There is insufficient memory for the Java Runtime Environment to continue.
Mar 20 14:38:56 kvm1.dawg.eu elasticsearch[7059]: # Native memory allocation (malloc) failed to allocate 2555904 bytes for ...emory.
Mar 20 14:38:56 kvm1.dawg.eu elasticsearch[7059]: # An error report file with more information is saved as:
Mar 20 14:38:56 kvm1.dawg.eu elasticsearch[7059]: # /tmp/jvm-7065/hs_error.log

Not configured for small KVM servers I guess, and JVMs tend to be a memory hog. What about alternatives like sphinx (http://www.scribd.com/doc/33308510/MongoDB-Full-Text-Search-With-Sphinx)?

MartinNowak commented 9 years ago

Xapian might be an interesting candidate as well, because we can directly integrate it in the app.

MartinNowak commented 8 years ago

The easiest way would be to use MongoDB 3.2's new Text Indexe feature, which allows to do a weighted search on all the text columns of our packages.

MartinNowak commented 8 years ago

Can we upgrade the production db @s-ludwig? I could run tests locally (already using MongoDB 3.2.4). I'd be interested in more details about the server setup and a recent (anonymized) db dump anyhow.

s-ludwig commented 8 years ago

The server is running the Ubuntu 14.04 LTS release and MongoDB is currently from the "10gen" repository (2.4.13). I have a commented out line for 3.x in my apt file, but I don't remember why I had to switch back. I've enabled the 3.x branch again now and it seems to run fine with 3.0.12.

Regarding the DB dump, it may work to just take the "packages" and "downloads" collections and not copy the user database at all. I'll have a try later today.

MartinNowak commented 8 years ago

At some point vibe.d had issues w/ MongoDB 3.x, but they were fixed. https://github.com/rejectedsoftware/vibe.d/issues/202 https://github.com/rejectedsoftware/vibe.d/issues/1243

MartinNowak commented 8 years ago

Can we make another small step to 3.2 @s-ludwig? https://docs.mongodb.com/v3.2/tutorial/install-mongodb-on-ubuntu/#install-mongodb-community-edition They improved a few things about token delimiters and case/diacrit insensitivity. https://docs.mongodb.com/v3.2/core/index-text/

s-ludwig commented 8 years ago

Running on 3.2 now!

MartinNowak commented 8 years ago

Running on 3.2 now!

Great, I'll have a try at scraping the packages and downloads from the running site.

MartinNowak commented 8 years ago

Great, I'll have a try at scraping the packages and downloads from the running site.

Somewhat tricky b/c the package info is subtly different. Could you help me out by sending me

mongoexport --db vpmreg --collection packages --out packages.json
mongoexport --db vpmreg --collection downloads --out downloads.json

via mail @s-ludwig? I can replace the user references with some dummy users. But I need many packages w/ real search terms, categories, and downloads to improve searching, ranking, et al.

WebFreak001 commented 7 years ago

As there isn't any activity on this, could you send me the dummy data too? I think making the search contain 3 sections (current search results, "just contains" and fuzzy search in that order) would be a good idea. It shows the more relevant packages with the current results first, then the packages that you would find by Ctrl-F on the old version of the website and then some other ones (fuzzy search should probably only start with 3 characters or more).

s-ludwig commented 7 years ago

You can now directly mirror the database like this: ./dub-registry --mirror https://code.dlang.org The raw data can be queried at http://code.dlang.org/api/packages/dump (download numbers not included)

I think the basic approach is good, but we might need to add a more complex weighting scheme rather than just strictly ordering by "quality class". Similar to how Google sometimes shows some "good" results that are no exact matches at the front of the list. Although with 1k packages this isn't hugely important yet.

WebFreak001 commented 7 years ago

imo that command should get added to the README

WebFreak001 commented 7 years ago

oh also I need to run that command with http:// because I needed to add VibeNoSSL as version for it to compile because openssl has been broken for a while with vibe and it still tries to use openssl even if I add botan.

MartinNowak commented 7 years ago

Reading https://www.sqlite.org/fts3.html and https://www.sqlite.org/fts5.html in more detail we might be able to build a decent search on top of sqlite. Using Contentless Tables to index the READMEs (if they are really that big) and fts4aux/fts5vocab for auto-completion of search terms. Just operating an elasticsearch instance would be another option.

MartinNowak commented 7 years ago

I think making the search contain 3 sections (current search results, "just contains" and fuzzy search in that order) would be a good idea.

Sounds confusing from a user perspective. The need for fuzzy search isn't that important, if people make typos they'll notice.

I think the basic approach is good, but we might need to add a more complex weighting scheme rather than just strictly ordering by "quality class". Similar to how Google sometimes shows some "good" results that are no exact matches at the front of the list. Although with 1k packages this isn't hugely important yet.

Just some combination of text matching with https://github.com/dlang/dub-registry/issues/159 should work fine. Maybe sth. along the lines of SELECT rowid FROM fts WHERE fts MATCH ? ORDER BY rank + weight; would already work fine enough.

MartinNowak commented 7 years ago

Elasticsearch might also be reusable for dlang.org.

MartinNowak commented 3 years ago

Maybe https://www.algolia.com/

pbackus commented 3 years ago

Frankly even something as simplistic as https://www.google.com/search?q=site:code.dlang.org %s would be an improvement over the status quo.

sarneaud commented 3 years ago

Some thoughts (because I've had to do this a few times now):

Algolia is powerful, but there's a pretty big downside for a project like this: data dependency on a third-party service (a bigger deal than just an API dependency). If you want the community to contribute improvements to dub-registry, it needs to be easy for people to just download the code and set up their own instances. Let's face it, support for that won't be well maintained because the only people who'll notice it's broken are new contributors (who'll just walk away).
Elasticsearch is also powerful, but the embedded FTS implementations are much less overhead in terms of day-to-day development because everything's self-contained, and the index being a file on disk integrates better with build pipelines, etc. External servers like Elasticsearch are better when you have replicated frontends dynamically updating a central data source. Dub isn't like that.
Specialised embedded FTS tools like Xapian are pretty cool.
The FTS features of DBs like Sqlite and Postgres are really nice if you're already using those DBs (otherwise other tools are more powerful). Moving all data to Sqlite or PG is obviously a whole bigger decision.
There are only ~2000 packages to search. Even if it were x1000, this wouldn't be a hard problem. There's a trap for CS graduates when they try to implement practical FTS: focussing too much on efficiently solving the exact substring match problem because that's what looks most like a CS textbook problem. That's the easy problem, though, and it's solved to death. What makes an FTS implementation good is stuff like spelling correction, approximate and semantic matching, sensible ranking, etc.

Imperatorn commented 3 years ago

https://github.com/dlang/dub-registry/pull/497

This works. I've tried it and didn't see any problems with performance

dd86k commented 2 years ago

Quality-wise it only seems to be doing searches by words.

e.g., Searching for "blak" returns nothing, "blake" again nothing, and "blake2" returns my blake2-d package. I'm unaware how MangoDB works, but at least with a typical SQL database server it's possible to (safely) prepend and suffix % to extend the search.

It does seem to have text search but I'm unaware how it's implemented in the vibe-d package and other details.

dlang / dub-registry

Search quality #93