Open nopper opened 2 years ago
This issue is because for a larger size of results retrieved, there are more groups of documents retrieved with the same score. So, given the same score, the order between these documents can vary in random fashion, affecting the strings.Join(results, "\n")
value, thereby the md5hash value.
For example, size = 35 and towards the end of the hit list on the 10th iteration the order looks like this
... 2022/05/05 10:52:20 score 0.8405445768932949, doc-970 2022/05/05 10:52:20 score 0.8405445768932949 doc-9262 2022/05/05 10:52:20 score 0.8405445768932949 doc-42708 ...
and at the around same relative position, on the 9th iteration the relative order between these hits look like
... 2022/05/05 10:52:18 score 0.8405445768932949 doc-42708 2022/05/05 10:52:18 score 0.8405445768932949 doc-970 2022/05/05 10:52:18 score 0.8405445768932949 doc-9262 ...
Solution for this would be to change the sortBy argument to something that's unique like _id
i.e., searchRequest.SortBy([]string{"-_score", "_id"})
This ensures that the order is unique and deterministic. Please let us know if this works out for you.
That's strange. In my setup the Title
should be unique for each document, and was expecting that this should be enough to break ties (this condition is not enforced in the test example I provided, but it should be easy to add and verify that this is not enough to pass the test). Why this isn't the case?
Unfortunately I can't use _id
. In our test setup we are building the index on the fly at startup and a streaming order of documents is not guaranteed, so enforcing a sort by the internal docid doesn't make sense for us.
Do you have any other suggestion? I was thinking to use some sort of _random
ordering with a fixed seed as proposed in https://github.com/blevesearch/bleve/issues/1299 , although the PR was never merged.
I also tried with the following sorting options but the issue remains:
searchRequest.SortByCustom([]search.SearchSort{
&search.SortField{Field: "-_score"},
&search.SortField{Field: "Title", Type: search.SortFieldAsString},
})
@sreekanth-cb since you are the most active maintainer at this point. Is this expected?
@nopper what you're expecting with the sort on two keys - ["score", "Title"] is perfectly reasonable. You're right - from your code that Title should be unique to help break ties in equal scores.
The problem I see in your code is from the Match Query over a 10-whitespace-separated-string-of-strings. Let's take an example for how a search term in your match query could look ..
"instincts psyche circlets motleys meanness frankly countryman crunchier banned charley"
You're not using any special analyzer, so the index and query defaults to using the standard
analyzer - which tokenizes on unicode.
Now, if you look at the structure of the match query ..
type MatchQuery struct {
Match string `json:"match"`
FieldVal string `json:"field,omitempty"`
Analyzer string `json:"analyzer,omitempty"`
BoostVal *Boost `json:"boost,omitempty"`
Prefix int `json:"prefix_length"`
Fuzziness int `json:"fuzziness"`
Operator MatchQueryOperator `json:"operator,omitempty"`
}
There is an Operator attribute - which defaults to "or". Now let me explain how the internals of the match query work.
standard
analyzer. The standard analyzer tokenizes the search criteria on unicode - which means it generates 10 tokens (separating on whitespace).or
, a disjunction ensues on the tokens generated. The operator can be overridden to and
in which case a conjunction ensues on the tokens generated.I believe you're not seeing deterministic results because a disjunction query over all tokens generated from your match query is matching partial results that you wouldn't like matched.
Here's how I'd change your code to obtain deterministic results ..
...
query := bleve.NewMatchQuery(babbler.Babble())
query.SetOperator(1) // Set match operator to AND
searchRequest := bleve.NewSearchRequestOptions(query, int(100), 0, true)
...
Optionally, you could look into using the keyword
analyzer for your index which generates a single token for your text - this will be applicable to the index and the match query's search criteria.
Thanks for the detailed reply. Just for context an OR query is what I want here as I'd like to optimize for recall. Ideally I would like to have a SHOULD match, instead of MUST match, borrowing lucene terminology. Not sure if it's possible in bleve.
Anyway I still don't get why a union instead of an intersection should provide a stable sorting. I get the fact that you get fewer results as it's a more restrictive search. The degenerate case would be all documents contain a single word and you query for it. If you search for that token you are expecting to get the dataset sorted only by the set of constrains you specify in a deterministic way, right?
I also noticed in another test, that the explanation string and the final score might not be deterministic. This seems to be due to the fact that matching terms are in different order sometimes and the final score computation is affected by the ordering of the float operations.
Ideally I would like to have a SHOULD match, instead of MUST match, borrowing lucene terminology. Not sure if it's possible in bleve.
Yes bleve does support the boolean query which comprises of the MUST, SHOULD and MUST_NOT clauses. https://github.com/blevesearch/bleve/blob/v2.3.2/search/query/boolean.go#L27...L33
Think I understand your use case better now. I see how the AND operation and even keyword analyzer is not what you're looking for. Either of these options would probably not even match any results.
Currently there isn't a way for a user to control the sort key chosen for a field if the field were an array of strings. So our advise at this point is to choose one such sort field which has a single element - be it numeric or text but unique to obtain deterministic results.
Hi @abhinavdangeti ,
In the end application I am currently using a query string as documented in http://blevesearch.com/docs/Query-String-Query/ . So the frontend app is actually generating a query like field1:term1 field1:term2 field2:term3 ..
. Is this equivalent to creating a conjunctive query with SHOULD constraints? or do I have to do something else? I don't see a way to specify AND/OR operations in the query string paradigm.
A query string query essentially translates to a BOOLEAN query underneath the hood that I highlighted earlier.
A boolean query is a compound query that contains a MUST (conjunctions) clause, a SHOULD (disjunctions) clause and a MUST_NOT (negate of disjunctions) clause.
Hope this table will explain how conjunctions and disjunctions are interpreted for a query ..
Query string query | Boolean query object |
---|---|
f1:term1 f2:term2 |
{"should": {"disjuncts": [{"field": "f1", "match": "term1"}, {"field": "f2", "match": "term2"}]}} |
+f1:term1 +f2:term2 |
{"must": {"conjuncts": [{"field": "f1", "match": "term1"}, {"field": "f2", "match": "term2"}]}} |
+f1:term1 f2:term2 |
{"must": {"conjuncts": [{"field": "f1", "match": "term1"}]}, "should": {"disjuncts": [{"field": "f2", "match": "term2"}]}} |
Hi there,
I am experiencing an issue where the list of hits returned is non deterministic and it's breaking some of our unit tests.
If I specify a
size=10
in NewSearchRequestOptions everything seems to work fine, but as soon I increase the size to something like100
or even30
the returned results are not in a deterministic order anymore.This happens also when using a standard file backed index, although the testcase below uses the in-memory index. Here's a simple golang test to reproduce the issue: