Dalvany commented 1 year ago

This pull request aims to address #19 issue by adding a Tantivy index. While there is still room for improvement, it might be a first step.

Configuration

Add a section [search] with a field directory that contains where Tantivy should store its files.

[search]
directory = "/tmp/alexandrie/tantivy

Search

As it uses QueryParser you can use full Tantivy query language.

By default search use all fields except name.prefix (see below) and suggester search amongst name, name.full and name.prefix (see below).

Implementation

`fts` module

This module contains all full text search related structures.

Tantivy structure handles all boiler plate to setup an index, search and suggest. It also delegate method to index document, commit documents.

TantivyDocument is a structure that represents a crate and can be converted into a Tantivy's Document

Indices

Crate's name are index multiple times to improve both result relevance of suggester and search.

name : a simple tokenized version of crate's name :
- tokenize on non alphanumeric character using SimpleTokenizer
- apply English stop words
- apply lower-casing to make search case insensitive
name.full : not tokenized, only lower-cased. It's main purpose is to increase relevance when the searched text match exactly a crate name
name.prefix : index word prefix to handle suggester.
- tokenize on non alphanumeric characters
- lower case
- apply a custom filter, edge ngram to index word prefixes.

Other fields that are indexed :

categories are index using the same pipeline as name.full as they should be amongst a precise list
keywords are index using the same pipeline as ̀name` as they are free text
description and readme use the same pipeline as name.

Note that at search time, we should not apply apply the edge ngram filter to reduce noise.

How to index

When Alexandrie starts, it index everything.

Things that still need work

[x] ~~Actually running indexer endpoint causes 500 HTTP error when trying to access UI. It comes from a lock on the database since I browse all crates for indexing in a single transaction.~~ Use run method and index at startup instead in an endpoint.
[x] ~~Need to change API search endpoint as I only change frontend search~~
[x] ~~New crates aren't yet indexed~~

Hirevo commented 1 year ago

Hi,

I've seen that you closed the old PR and named this one as draft.
Do you wish for me to convert this PR to an actual draft PR ?

I'm sorry for not having commented on your old PR, I wasn't sure if you were expecting/ready for a review or just general feedback.

As someone who is new to working with search engines, It is very fascinating for me to learn from your code how full text indexing works and how to optimize the relevancy of the results.

I have tried your branch on my local testing setup and the automatic and fast indexing of the crates (~34000 crates) and the very notable improvement of the search results, all without needing much configuration at all blew me away !

Feel free to ping me anytime if you want me to start reviewing some code or other more general feedback !

Dalvany commented 1 year ago

Hi @Hirevo I closed the old PR as there was conflicts and find easier to redo thing on a proper branch instead of the forked master.

If you could, yes it would be nice to make it a proper draft as I still have to work on the API search endpoint but I must confess I'm doing a pause right now. If you wish, make any comments, feedback or ask for any explanation, I'll take care of things when I'll get back working on it :-)

For the performance part, I tested it with crates.io index (only indexing crate's name, no keywords or categories) and it got indexed in 15s (if I remember) when I do it on startup, while when I made a custom endpoint it took something like 10 minutes (I don't know how to explain such a difference, perhaps I didn't ran Alexandrie with release flag and/or it comes from the RWLock on the indexer).

Hirevo commented 1 year ago

I must confess I'm doing a pause right now

Understandable, no need to pressure yourself, there is no rush at all. :)

For the performance part, I tested it with crates.io index (only indexing crate's name, no keywords or categories) and it got indexed in 15s (if I remember) when I do it on startup

That's amazing !

My testing setup with the 34 000 crates is from the DB dump from crates.io and I have the rendered READMEs for all of them, but it is quite old, which is why it isn't up to the 116 000+ crates that crates.io has today.

I should get around to bring my setup more up-to-date to potentially also be able to contribute more substantial performance testing (even though I am not even sure if this kind of scale will ever be needed, and I suspect that people won't be constantly restarting their registry with 100 000+ crates).

Dalvany commented 1 year ago

I suspect that people won't be constantly restarting their registry with 100 000+ crates)

Yeah, you're right, especially since Alexandrie works great ! Good job by the way !

Dalvany commented 1 year ago

I have a remark for e2e testing, I think you should consider looking for integration testing framework. I only used two of them python's behave and cucumber-rs (they work the same way). I don't know if they're really a good fit here but I think it might worth looking into it.

Dalvany commented 1 year ago

Hi @Hirevo I added search to the api search endpoint. I had to introduce a limit to the number of result per page since you can't construct a Tantivy's TopDocs without. I set it to 15 and reflect that change in the book.

I don't index README as I don't think it's really useful: it will serve few use case. But if you want I can work on that.

Dalvany commented 1 year ago

There's a thing you should know, I use a crate I made, tantivy-analysis-contrib, to make the suggester work the way I intended to (doing the "word starts with XXX") as Tantivy provide's a ngram only as tokenizer not as a filter. A tokenizer is responsible of breaking the whole text into tokens (basically "words"), when a filter is responsible to "transform" each token (lowercasing, generating synonyms, ...etc).

Hirevo commented 1 year ago

I have a remark for e2e testing, I think you should consider looking for integration testing framework. I only used two of them python's behave and cucumber-rs (they work the same way). I don't know if they're really a good fit here but I think it might worth looking into it.

There is definitely lots of improvements to be made to the E2E testing setup, even just in terms of performance, as it takes quite some time for them to run to completion (~95 % of it is spent just building the docker image).
What I like with the current setup though is that the scenarios are simple shell scripts in which we can run cargo commands just as a user would do them.

I had to introduce a limit to the number of result per page since you can't construct a Tantivy's TopDocs without. I set it to 15 and reflect that change in the book.

Yeah, thank you for doing that. Doing infinite results by default seems like a poor choice anyway.
One of my old comments in the code you removed seems to indicate that I was already a bit doubtful about this choice back then, not sure what made me go forward with it. :/

I don't index README as I don't think it's really useful: it will serve few use case. But if you want I can work on that.

No worries, my guess is that it can be added later if the need for it comes. You're right that the combination of the crate name, categories and keywords should be enough (making the crate more easily discoverable is their primary purpose already, so crate authors are likely to use them appropriately).

There's a thing you should know, I use a crate I made, tantivy-analysis-contrib, to make the suggester work the way I intended to (doing the "word starts with XXX") as Tantivy provide's a ngram only as tokenizer not as a filter.

That's not an issue, I am completely fine with that, but thanks for the heads up.
Nice work on that crate by the way, it seems like it adds some useful and non-trivial tokenizers.

I'll jump on to reviewing the PR, I'll try to have it completed either tonight or tomorrow.

Dalvany commented 1 year ago

A thing that I think I didn't describe: we can restrict search to a specific field by typing

field:text

For example looking for crates that contains the keyword log :

keywords:log

Hirevo commented 1 year ago

A thing that I think I didn't describe: we can restrict search to a specific field by typing
field:text
For example looking for crates that contains the keyword log :
keywords:log

Oh wow, I didn't notice that, that's really cool !

Dalvany commented 1 year ago

Thanks, I'll try to make the changes this week

Dalvany commented 1 year ago

I thought there was a comment on the zero result on empty search. It should be easy to solve using an AllQuery. I can make the change if you wish.

Dalvany commented 1 year ago

I applied everything, I hope I didn't missed something. I added a commit to handle empty query using an AllQuery. I made a few manual check to see if everything is correct :

typing in the search box to see if suggestion work properly
search for "log" in the frontend
running an empty search. Perhaps we should improve that by sorting alphabetically (but I don't know how to do that with tantivy, perhaps using IndexSortByField in index settings, but that's not done at query time).
call api/v1/search?q=log

Dalvany commented 1 year ago

Perhaps it would be a goof thing to add a few explanation on search in the book :

though it search also through keywords and category, it's mostly designed to search by crate's name
an or is made between words (it Tantivy's default see set conjonction)
explaining how to "filter" by keywords and categories (eg. actix +categories:"web")
perhaps adding a link to QueryParser documentation

Dalvany commented 1 year ago

Hello @Hirevo, I made all the changes

Dalvany commented 1 year ago

Hello, don't worry :-) Don't hesitate to ping me if there's something wrong or anything.

Hirevo / alexandrie

Add tantivy index #161

Configuration

Search

Implementation

`fts` module

Indices

How to index

Things that still need work

Hirevo / alexandrie

Add tantivy index #161

Configuration

Search

Implementation

fts module

Indices

How to index

Things that still need work

`fts` module