blevesearch / bleve

A modern text/numeric/geo-spatial/vector indexing library for go
Apache License 2.0
10.01k stars 678 forks source link

Non deterministic order of results when requesting larger number of retrieved documents #1677

Open nopper opened 2 years ago

nopper commented 2 years ago

Hi there,

I am experiencing an issue where the list of hits returned is non deterministic and it's breaking some of our unit tests.

If I specify a size=10 in NewSearchRequestOptions everything seems to work fine, but as soon I increase the size to something like 100 or even 30 the returned results are not in a deterministic order anymore.

This happens also when using a standard file backed index, although the testcase below uses the in-memory index. Here's a simple golang test to reproduce the issue:

package whatever

import (
    "crypto/md5"
    "encoding/hex"
    "fmt"
    "math/rand"
    "strings"
    "testing"

    "github.com/blevesearch/bleve/v2"
)

func init() {
    rand.Seed(42)
}

type BleveDocument struct {
    Title string
}

type Babbler struct {
    Count     int
    Separator string
    Words     []string
}

func NewBabbler() (b Babbler) {
    b.Count = 10
    b.Separator = " "
        // Random set of 100 words
    b.Words = []string{
        "instincts",
        "psyche",
        "circlets",
        "motleys",
        "meanness",
        "frankly",
        "countryman",
        "crunchier",
        "banned",
        "charley",
        "tortuously",
        "sherbets",
        "amusements",
        "yearn",
        "barrens",
        "stylistically",
        "furl",
        "chengdu",
        "discharging",
        "crests",
        "lampreys",
        "parapsychology",
        "imbecilities",
        "preshrinks",
        "dolorous",
        "sera",
        "pregnant",
        "olduvai",
        "lightninged",
        "prophylaxis",
        "gnashing",
        "slipping",
        "medleys",
        "ghana",
        "leaner",
        "hilary",
        "steinem",
        "pronghorn",
        "groom",
        "erato",
        "wahhabi",
        "calumniated",
        "cancelation",
        "indeterminate",
        "bedbugs",
        "ailed",
        "inchon",
        "chauffeurs",
        "miaow",
        "lazybones",
        "lapps",
        "detours",
        "pinpointed",
        "guardsman",
        "yarmulke",
        "honied",
        "display",
        "fur",
        "kiddos",
        "overstayed",
        "farts",
        "sweatshirts",
        "issuance",
        "stewardess",
        "smuggled",
        "jimmy",
        "unprincipled",
        "beethoven",
        "irrepressible",
        "reminiscences",
        "sojourns",
        "viciously",
        "coauthored",
        "chid",
        "soothed",
        "patienter",
        "anticlimaxes",
        "commemoration",
        "nanobot",
        "attractiveness",
        "curates",
        "uninformed",
        "scrutinize",
        "delightful",
        "caterers",
        "louts",
        "incubi",
        "shaw",
        "dazzles",
        "comestibles",
        "overdrive",
        "daring",
        "dtp",
        "nonpayments",
        "bandied",
        "carnivore",
        "bird",
        "cinerama",
        "profiles",
        "albuquerque",
    }

    return
}

func (this Babbler) Babble() string {
    pieces := []string{}
    for i := 0; i < this.Count; i++ {
        pieces = append(pieces, this.Words[rand.Int()%len(this.Words)])
    }

    return strings.Join(pieces, this.Separator)
}

func GetMD5Hash(text string) string {
    hash := md5.Sum([]byte(text))
    return hex.EncodeToString(hash[:])
}

func TestBleveDeterministic(t *testing.T) {
    t.Parallel()

    var hashSet = make(map[string]bool)

    for iteration := 0; iteration < 10; iteration += 1 {
        rand.Seed(42)
        babbler := NewBabbler()
        indexMapping := bleve.NewIndexMapping()

        documentMapping := bleve.NewDocumentMapping()
        indexMapping.AddDocumentMapping("Document", documentMapping)

        titleFieldMapping := bleve.NewTextFieldMapping()
        titleFieldMapping.Store = true
        documentMapping.AddFieldMappingsAt("Title", titleFieldMapping)
        index, err := bleve.NewUsing("", indexMapping, bleve.Config.DefaultIndexType, bleve.Config.DefaultMemKVStore, nil)
        if err != nil {
            t.Errorf("Error")
            return
        }
        batch := index.NewBatch()

        for i := 0; i < 50000; i++ {
            docId := fmt.Sprintf("doc-%d", i)
            err = batch.Index(docId, BleveDocument{
                Title: babbler.Babble(),
            })
            if err != nil {
                t.Errorf("Error")
                return
            }
        }
        err = index.Batch(batch)
        if err != nil {
            t.Errorf("Error")
            return
        }

        query := bleve.NewMatchQuery(babbler.Babble())
        searchRequest := bleve.NewSearchRequestOptions(query, int(100), 0, true)
        searchRequest.SortBy([]string{"-_score", "Title"})
        searchRequest.Fields = append(searchRequest.Fields, "Title")
        searchResults, err := index.Search(searchRequest)
        if err != nil {
            t.Errorf("Error")
            return
        }

        results := []string{}
        for _, hit := range searchResults.Hits {
            results = append(results, hit.Fields["Title"].(string))
        }

        md5hash := GetMD5Hash(strings.Join(results, "\n"))
        t.Logf("MD5 for iteration %d is = %s\n", iteration, md5hash)

        hashSet[md5hash] = true
    }

    if len(hashSet) > 1 {
        t.Error("Non deterministic results")
    }
}
Thejas-bhat commented 2 years ago

This issue is because for a larger size of results retrieved, there are more groups of documents retrieved with the same score. So, given the same score, the order between these documents can vary in random fashion, affecting the strings.Join(results, "\n") value, thereby the md5hash value.

For example, size = 35 and towards the end of the hit list on the 10th iteration the order looks like this

... 2022/05/05 10:52:20 score 0.8405445768932949, doc-970 2022/05/05 10:52:20 score 0.8405445768932949 doc-9262 2022/05/05 10:52:20 score 0.8405445768932949 doc-42708 ...

and at the around same relative position, on the 9th iteration the relative order between these hits look like

... 2022/05/05 10:52:18 score 0.8405445768932949 doc-42708 2022/05/05 10:52:18 score 0.8405445768932949 doc-970 2022/05/05 10:52:18 score 0.8405445768932949 doc-9262 ...

Solution for this would be to change the sortBy argument to something that's unique like _id i.e., searchRequest.SortBy([]string{"-_score", "_id"})

This ensures that the order is unique and deterministic. Please let us know if this works out for you.

nopper commented 2 years ago

That's strange. In my setup the Title should be unique for each document, and was expecting that this should be enough to break ties (this condition is not enforced in the test example I provided, but it should be easy to add and verify that this is not enough to pass the test). Why this isn't the case?

Unfortunately I can't use _id. In our test setup we are building the index on the fly at startup and a streaming order of documents is not guaranteed, so enforcing a sort by the internal docid doesn't make sense for us.

Do you have any other suggestion? I was thinking to use some sort of _random ordering with a fixed seed as proposed in https://github.com/blevesearch/bleve/issues/1299 , although the PR was never merged.

nopper commented 2 years ago

I also tried with the following sorting options but the issue remains:

searchRequest.SortByCustom([]search.SearchSort{
    &search.SortField{Field: "-_score"},
    &search.SortField{Field: "Title", Type: search.SortFieldAsString},
})
nopper commented 2 years ago

@sreekanth-cb since you are the most active maintainer at this point. Is this expected?

abhinavdangeti commented 2 years ago

@nopper what you're expecting with the sort on two keys - ["score", "Title"] is perfectly reasonable. You're right - from your code that Title should be unique to help break ties in equal scores.

The problem I see in your code is from the Match Query over a 10-whitespace-separated-string-of-strings. Let's take an example for how a search term in your match query could look ..

"instincts psyche circlets motleys meanness frankly countryman crunchier banned charley"

You're not using any special analyzer, so the index and query defaults to using the standard analyzer - which tokenizes on unicode.

Now, if you look at the structure of the match query ..

type MatchQuery struct {
    Match     string             `json:"match"`
    FieldVal  string             `json:"field,omitempty"`
    Analyzer  string             `json:"analyzer,omitempty"`
    BoostVal  *Boost             `json:"boost,omitempty"`
    Prefix    int                `json:"prefix_length"`
    Fuzziness int                `json:"fuzziness"`
    Operator  MatchQueryOperator `json:"operator,omitempty"`
}

There is an Operator attribute - which defaults to "or". Now let me explain how the internals of the match query work.

I believe you're not seeing deterministic results because a disjunction query over all tokens generated from your match query is matching partial results that you wouldn't like matched.

Here's how I'd change your code to obtain deterministic results ..

                ...
        query := bleve.NewMatchQuery(babbler.Babble())
                 query.SetOperator(1)                // Set match operator to AND
        searchRequest := bleve.NewSearchRequestOptions(query, int(100), 0, true)
                ...

Optionally, you could look into using the keyword analyzer for your index which generates a single token for your text - this will be applicable to the index and the match query's search criteria.

nopper commented 2 years ago

Thanks for the detailed reply. Just for context an OR query is what I want here as I'd like to optimize for recall. Ideally I would like to have a SHOULD match, instead of MUST match, borrowing lucene terminology. Not sure if it's possible in bleve.

Anyway I still don't get why a union instead of an intersection should provide a stable sorting. I get the fact that you get fewer results as it's a more restrictive search. The degenerate case would be all documents contain a single word and you query for it. If you search for that token you are expecting to get the dataset sorted only by the set of constrains you specify in a deterministic way, right?

I also noticed in another test, that the explanation string and the final score might not be deterministic. This seems to be due to the fact that matching terms are in different order sometimes and the final score computation is affected by the ordering of the float operations.

abhinavdangeti commented 2 years ago

Ideally I would like to have a SHOULD match, instead of MUST match, borrowing lucene terminology. Not sure if it's possible in bleve.

Yes bleve does support the boolean query which comprises of the MUST, SHOULD and MUST_NOT clauses. https://github.com/blevesearch/bleve/blob/v2.3.2/search/query/boolean.go#L27...L33

Think I understand your use case better now. I see how the AND operation and even keyword analyzer is not what you're looking for. Either of these options would probably not even match any results.

Currently there isn't a way for a user to control the sort key chosen for a field if the field were an array of strings. So our advise at this point is to choose one such sort field which has a single element - be it numeric or text but unique to obtain deterministic results.

nopper commented 2 years ago

Hi @abhinavdangeti ,

In the end application I am currently using a query string as documented in http://blevesearch.com/docs/Query-String-Query/ . So the frontend app is actually generating a query like field1:term1 field1:term2 field2:term3 ... Is this equivalent to creating a conjunctive query with SHOULD constraints? or do I have to do something else? I don't see a way to specify AND/OR operations in the query string paradigm.

abhinavdangeti commented 2 years ago

A query string query essentially translates to a BOOLEAN query underneath the hood that I highlighted earlier.

A boolean query is a compound query that contains a MUST (conjunctions) clause, a SHOULD (disjunctions) clause and a MUST_NOT (negate of disjunctions) clause.

Hope this table will explain how conjunctions and disjunctions are interpreted for a query ..

Query string query Boolean query object
f1:term1 f2:term2 {"should": {"disjuncts": [{"field": "f1", "match": "term1"}, {"field": "f2", "match": "term2"}]}}
+f1:term1 +f2:term2 {"must": {"conjuncts": [{"field": "f1", "match": "term1"}, {"field": "f2", "match": "term2"}]}}
+f1:term1 f2:term2 {"must": {"conjuncts": [{"field": "f1", "match": "term1"}]}, "should": {"disjuncts": [{"field": "f2", "match": "term2"}]}}