This pull request aims to address #19 issue by adding a Tantivy index.
While there is still room for improvement, it might be a first step.
Configuration
Add a section [search] with a field directory that contains where Tantivy should store its files.
[search]
directory = "/tmp/alexandrie/tantivy
Search
As it uses QueryParser you can use full Tantivy query language.
By default search use all fields except name.prefix (see below) and suggester search amongst name, name.full and name.prefix (see below).
Implementation
fts module
This module contains all full text search related structures.
Tantivy structure handles all boiler plate to setup an index, search and suggest. It also delegate method to index document, commit documents.
TantivyDocument is a structure that represents a crate and can be converted into a Tantivy's Document
Indices
Crate's name are index multiple times to improve both result relevance of suggester and search.
name : a simple tokenized version of crate's name :
tokenize on non alphanumeric character using SimpleTokenizer
apply English stop words
apply lower-casing to make search case insensitive
name.full : not tokenized, only lower-cased. It's main purpose is to increase relevance when the searched text match exactly a crate name
name.prefix : index word prefix to handle suggester.
tokenize on non alphanumeric characters
lower case
apply a custom filter, edge ngram to index word prefixes.
Other fields that are indexed :
categories are index using the same pipeline as name.full as they should be amongst a precise list
keywords are index using the same pipeline as ̀name` as they are free text
description and readme use the same pipeline as name.
Note that at search time, we should not apply apply the edge ngram filter to reduce noise.
How to index
When Alexandrie starts, it index everything.
Things that still need work
[x] Actually running indexer endpoint causes 500 HTTP error when trying to access UI. It comes from a lock on the database since I browse all crates for indexing in a single transaction. Use run method and index at startup instead in an endpoint.
[ ] Need to change API search endpoint as I only change frontend search
[ ] New crates aren't yet indexed
[ ] Though the field exists in Tantivy, readme aren't indexed
This pull request aims to address #19 issue by adding a Tantivy index. While there is still room for improvement, it might be a first step.
Configuration
Add a section [search] with a field
directory
that contains where Tantivy should store its files.Search
As it uses QueryParser you can use full Tantivy query language.
By default search use all fields except
name.prefix
(see below) and suggester search amongstname
,name.full
andname.prefix
(see below).Implementation
fts
moduleThis module contains all full text search related structures.
Tantivy
structure handles all boiler plate to setup an index, search and suggest. It also delegate method to index document, commit documents.TantivyDocument
is a structure that represents a crate and can be converted into a Tantivy's DocumentIndices
Crate's name are index multiple times to improve both result relevance of suggester and search.
name
: a simple tokenized version of crate's name :name.full
: not tokenized, only lower-cased. It's main purpose is to increase relevance when the searched text match exactly a crate namename.prefix
: index word prefix to handle suggester.Other fields that are indexed :
categories
are index using the same pipeline asname.full
as they should be amongst a precise listkeywords
are index using the same pipeline as ̀name` as they are free textdescription
andreadme
use the same pipeline asname
.Note that at search time, we should not apply apply the edge ngram filter to reduce noise.
How to index
When Alexandrie starts, it index everything.
Things that still need work
Actually running indexer endpoint causes 500 HTTP error when trying to access UI. It comes from a lock on the database since I browse all crates for indexing in a single transaction.Use run method and index at startup instead in an endpoint.