blevesearch / bleve

A modern text/numeric/geo-spatial/vector indexing library for go
Apache License 2.0
9.98k stars 676 forks source link

Unexpected "No Matches" in FuzzyQuery #2057

Closed Simerax closed 1 month ago

Simerax commented 1 month ago

I'm a little confused on why certain words don't match.

In particular I noticed that the word "Security" is not found in a simple fuzzy query and I don't understand why. I used the benchmark_data.txt as document content.

package main

import (
    "fmt"
    "os"

    _ "embed"

    "github.com/blevesearch/bleve/analysis/lang/en"
    "github.com/blevesearch/bleve/v2"
)

const indexDir = "./myindex"

//go:embed benchmark_data.txt
var docContent string

type Document struct {
    Title   string
    Content string
}

func (d Document) Type() string {
    return "document"
}

func main() {
    os.RemoveAll(indexDir)

    docMapping := bleve.NewDocumentStaticMapping()
    docMapping.AddFieldMappingsAt("Content", bleve.NewTextFieldMapping())
    docMapping.AddFieldMappingsAt("Title", bleve.NewTextFieldMapping())

    mapping := bleve.NewIndexMapping()
    mapping.DefaultAnalyzer = en.AnalyzerName

    mapping.AddDocumentMapping("document", docMapping)
    index, err := bleve.New(indexDir, mapping)
    if err != nil {
        panic(err)
    }

    if err := index.Index("1", Document{
        Title:   "boiling liquid expanding vapour explosion",
        Content: docContent,
    }); err != nil {
        panic(err)
    }

    queries := []string{
        "Security",
        "Securit",
        "securit",
        "securi",
        "Burma",
    }

    for _, query := range queries {
        sr := bleve.NewSearchRequest(bleve.NewFuzzyQuery(query))
        results, err := index.Search(sr)
        if err != nil {
            panic(err)
        }
        fmt.Printf("Fuzzy Query: %s - Hits: %d\n", query, len(results.Hits))
    }

    fmt.Println() //newline
    for _, query := range queries {
        sr := bleve.NewSearchRequest(bleve.NewMatchQuery(query))
        results, err := index.Search(sr)
        if err != nil {
            panic(err)
        }
        fmt.Printf("Match Query: %s - Hits: %d\n", query, len(results.Hits))
    }
}

Output:

Fuzzy Query: Security - Hits: 0
Fuzzy Query: Securit - Hits: 0
Fuzzy Query: securit - Hits: 0
Fuzzy Query: securi - Hits: 1
Fuzzy Query: Burma - Hits: 1

Match Query: Security - Hits: 1
Match Query: Securit - Hits: 0
Match Query: securit - Hits: 0
Match Query: securi - Hits: 0
Match Query: Burma - Hits: 1

As you can see when I use a fuzzy query the exact match Security is not matched. Even very close "fuzzy" words do not match (such as Securit). However it does match securi. Another random word Burma is matched exactly.

When running a match query it matches Security and Burma - as expected.

I don't quite understand why the fuzzy query only matches securi and not the "more exact" words.

Is this a Bug or am I doing something wrong?

abhinavdangeti commented 1 month ago

@Simerax Multiple things that are in play here -

  1. If you use the en analyzer over your text - it'd apply the to_lower token filter and would the stemmer_en_snowball would stem the words down to their root forms before indexing .. meaning for Security, it's root word secur will be indexed. Now to your query - it's fuzzy is what we call a "non-analytic" query, meaning no analysis is applied to it the what you search will be looked for as is. Case is preserved, and no stemming.
  2. You're not really setting the field for the fuzzy query, so we'll look for the existence of the search term in the composite _all field - which I believe your index mapping automatically is doing already, so all good here.

So here're some recommendations you can choose from -

Simerax commented 1 month ago

Thank you for the thorough reply!

My plan is to build a simple documentation search based on bleve. I thought the "fuzzy" query would be the best fit for endusers in a search prompt.

I think I did not understand the implications of the fuzzy query on a stemmed index. The MatchQuery seems to be a much better fit here.

I will play around with different combinations of analyzers and query types to see what works best.