blevesearch / bleve

A modern text/numeric/geo-spatial/vector indexing library for go
Apache License 2.0
10.09k stars 686 forks source link

Trigram assisted wildcard query and general index efficiency #1794

Open CMajeri opened 1 year ago

CMajeri commented 1 year ago

Hello, I'm new to bleve and trying to use it to perform simple substring search, i.e. the equivalent of the sql query LIKE '%<some_word>%'. A typical way to achieve this is through the use of trigrams, where we match all entries that contain all trigrams, and then follow that up with a second filter operation. I tried to replicate this in the beer-search context, and came up with this:

    q := query.NewMatchQuery("pale")
    q.SetField("name_tri")
    q.SetOperator(query.MatchQueryOperatorAnd)
    req := bleve.NewSearchRequest(query.NewConjunctionQuery([]query.Query{
        q,
        query.NewWildcardQuery("*pale*"),
    }))
    req.Fields = []string{"name"}
    req.SortBy([]string{"name"})
    req.Size = 1000
    res, err := beerIndex.Search(req)
    if err != nil {
        panic(err)
    }
    for _, r := range res.Hits {
        if !strings.Contains(strings.ToLower(r.Fields["name"].(string)), "pale") {
            fmt.Println(r.Fields["name"])
        }
    }

where "name_tri" is a text mapping, using a trigram as a token filter. This works perfectly, and doesn't print any rows. For comparison, without the wildcard query, this prints:

Cow Palace Scotch Ale
Cow Palace Scotch Ale 1998
Cow Palace Scotch Ale 2000
Cow Palace Scotch Ale 2001
Lone Palm Ale
Palm Speciale

which is expected.

However, I'm unfamiliar with how bleve performs its indexing, and was wondering how well it combines those filters. Will the presence of the wildcard completely negate the benefits of filtering through trigrams? In general, how does bleve combine filters, is there any documentation on the subject besides the code? I'd also be particularly interested in knowing how sorting works, espectially in the context of paginating results (i.e. running the same query many times with different limits and different offsets).

Thanks.

abhinavdangeti commented 1 year ago

@CMajeri here're a few details that should help answering your questions ..