Open xaqq opened 9 years ago
We should setup an ElasticSearch instance to solve these issues. This would also allow to search in the readme, fix spelling and provide suggestions.
Just found https://github.com/intrica/elasticsearch-d, didn't find it using the search though :(. Maybe @intrica can help us setting up an ElasticSearch server for the dub-registry?
In the meantime (and as long as we are as small as now), we could also go for the brute force route and simply scan through all packages linearly. I could quickly add something using levenshteinDistance
and see how fast it is.
A simple version is running now, seems to be fast enough for the moment. It still needs some tweaking, though (e.g. any three letter sequence will match any three letter word).
Should be relatively usable now. It would be nice to also match parts of a word, so that for example "elastic search" matches "elasticsearch".
I could quickly add something using levenshteinDistance and see how fast it is.
Guess you don't know this? https://github.com/D-Programming-Language/dub/blob/ad9b43d6a150b53872db73c161e745c000cf8577/source/dub/internal/utils.d#L246 https://github.com/D-Programming-Language/dub/pull/453
A simple version is running now, seems to be fast enough for the moment. It still needs some tweaking, though (e.g. any three letter sequence will match any three letter word).
And there is an easy opportunity to make the function much faster. Also there is a more space efficient variant of the algorithm. http://en.wikipedia.org/wiki/Levenshtein_distance#Iterative_with_two_matrix_rows Issue 13834 – make levenshteinDistance @nogc
Guess you don't know this?
Oh, nice suprise ;) No I just now stumbled over parts of an earlier discussion, but basically missed everything that happened around December.
I really should improve my email habits. I only just noticed this.
Elastic search is pretty awesome and very easy to set up. The only downside to it is obviously into structure to run the server. They recommend you use a cluster of three is in service, however I have many times just run it as a single instance when there is only a small amount of data to index as rebuilding the whole index and scratch only takes minutes in the case of a small database.
That's the catch though, you need to deploy an elasticsearch instance, preferably sandboxed in some way to prevent it from hogging resources. My job projects are currently deployed using CoreOS so I just run elasticsearch in a docket container next to it. Works beautifully.
If you ever serious about doing this, I'd be happy to help. I think the D site and the dub registry could do with a dose of modernisation :-)
Sounds good, and any help making the websites nice is highly appreciated. @s-ludwig can we start this by setting up an elasticsearch server? Also the download stats require an update of MongoDB to 2.6.
In the long run docker might indeed be interesting to deploy the registry.
Mmh, a quick attempt ran into the mentioned memory issues.
Mar 20 14:38:56 kvm1.dawg.eu systemd[1]: PID file /var/run/elasticsearch/elasticsearch.pid not readable (yet?) after start.
Mar 20 14:38:56 kvm1.dawg.eu elasticsearch[7059]: OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fd081000...no=13)
Mar 20 14:38:56 kvm1.dawg.eu elasticsearch[7059]: #
Mar 20 14:38:56 kvm1.dawg.eu elasticsearch[7059]: # There is insufficient memory for the Java Runtime Environment to continue.
Mar 20 14:38:56 kvm1.dawg.eu elasticsearch[7059]: # Native memory allocation (malloc) failed to allocate 2555904 bytes for ...emory.
Mar 20 14:38:56 kvm1.dawg.eu elasticsearch[7059]: # An error report file with more information is saved as:
Mar 20 14:38:56 kvm1.dawg.eu elasticsearch[7059]: # /tmp/jvm-7065/hs_error.log
Not configured for small KVM servers I guess, and JVMs tend to be a memory hog. What about alternatives like sphinx (http://www.scribd.com/doc/33308510/MongoDB-Full-Text-Search-With-Sphinx)?
Xapian might be an interesting candidate as well, because we can directly integrate it in the app.
The easiest way would be to use MongoDB 3.2's new Text Indexe feature, which allows to do a weighted search on all the text columns of our packages.
Can we upgrade the production db @s-ludwig? I could run tests locally (already using MongoDB 3.2.4). I'd be interested in more details about the server setup and a recent (anonymized) db dump anyhow.
The server is running the Ubuntu 14.04 LTS release and MongoDB is currently from the "10gen" repository (2.4.13). I have a commented out line for 3.x in my apt file, but I don't remember why I had to switch back. I've enabled the 3.x branch again now and it seems to run fine with 3.0.12.
Regarding the DB dump, it may work to just take the "packages" and "downloads" collections and not copy the user database at all. I'll have a try later today.
At some point vibe.d had issues w/ MongoDB 3.x, but they were fixed. https://github.com/rejectedsoftware/vibe.d/issues/202 https://github.com/rejectedsoftware/vibe.d/issues/1243
Can we make another small step to 3.2 @s-ludwig? https://docs.mongodb.com/v3.2/tutorial/install-mongodb-on-ubuntu/#install-mongodb-community-edition They improved a few things about token delimiters and case/diacrit insensitivity. https://docs.mongodb.com/v3.2/core/index-text/
Running on 3.2 now!
Running on 3.2 now!
Great, I'll have a try at scraping the packages and downloads from the running site.
Great, I'll have a try at scraping the packages and downloads from the running site.
Somewhat tricky b/c the package info is subtly different. Could you help me out by sending me
mongoexport --db vpmreg --collection packages --out packages.json
mongoexport --db vpmreg --collection downloads --out downloads.json
via mail @s-ludwig? I can replace the user references with some dummy users. But I need many packages w/ real search terms, categories, and downloads to improve searching, ranking, et al.
As there isn't any activity on this, could you send me the dummy data too? I think making the search contain 3 sections (current search results, "just contains" and fuzzy search in that order) would be a good idea. It shows the more relevant packages with the current results first, then the packages that you would find by Ctrl-F on the old version of the website and then some other ones (fuzzy search should probably only start with 3 characters or more).
You can now directly mirror the database like this: ./dub-registry --mirror https://code.dlang.org
The raw data can be queried at http://code.dlang.org/api/packages/dump (download numbers not included)
I think the basic approach is good, but we might need to add a more complex weighting scheme rather than just strictly ordering by "quality class". Similar to how Google sometimes shows some "good" results that are no exact matches at the front of the list. Although with 1k packages this isn't hugely important yet.
imo that command should get added to the README
oh also I need to run that command with http://
because I needed to add VibeNoSSL as version for it to compile because openssl has been broken for a while with vibe and it still tries to use openssl even if I add botan.
Reading https://www.sqlite.org/fts3.html and https://www.sqlite.org/fts5.html in more detail we might be able to build a decent search on top of sqlite. Using Contentless Tables to index the READMEs (if they are really that big) and fts4aux/fts5vocab for auto-completion of search terms. Just operating an elasticsearch instance would be another option.
I think making the search contain 3 sections (current search results, "just contains" and fuzzy search in that order) would be a good idea.
Sounds confusing from a user perspective. The need for fuzzy search isn't that important, if people make typos they'll notice.
I think the basic approach is good, but we might need to add a more complex weighting scheme rather than just strictly ordering by "quality class". Similar to how Google sometimes shows some "good" results that are no exact matches at the front of the list. Although with 1k packages this isn't hugely important yet.
Just some combination of text matching with https://github.com/dlang/dub-registry/issues/159 should work fine.
Maybe sth. along the lines of SELECT rowid FROM fts WHERE fts MATCH ? ORDER BY rank + weight;
would already work fine enough.
Elasticsearch might also be reusable for dlang.org.
Maybe https://www.algolia.com/
Frankly even something as simplistic as https://www.google.com/search?q=site:code.dlang.org %s
would be an improvement over the status quo.
Some thoughts (because I've had to do this a few times now):
https://github.com/dlang/dub-registry/pull/497
This works. I've tried it and didn't see any problems with performance
Quality-wise it only seems to be doing searches by words.
e.g., Searching for "blak" returns nothing, "blake" again nothing, and "blake2" returns my blake2-d package. I'm unaware how MangoDB works, but at least with a typical SQL database server it's possible to (safely) prepend and suffix %
to extend the search.
It does seem to have text search but I'm unaware how it's implemented in the vibe-d package and other details.
Hello,
When searching for
zmq
on the dub package registry (http://code.dlang.org/search?q=zmq) only one result is returned:zmq-d
.There is however another package,
zmqd
(the one I was looking for) that exists (http://code.dlang.org/packages/zmqd), but the search didn't return it. I believe this to be an issue.Thanks