Open cbuescher opened 5 years ago
Pinging @elastic/es-search
As a simple recreation, consider this example:
PUT test_index
{
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"name_completion": {
"type": "completion",
"analyzer": "my_analyzer"
}
}
},
"settings": {
"index": {
"number_of_shards": "2",
"analysis": {
"filter": {
"my_synonym": {
"type": "synonym",
"synonyms": [
"meyer => meyer, m",
"mueller => mueller, m",
"mann => mann, m",
"meier => meier, m",
"murnau => murnau, m",
"munch => munch, m",
"myerz = myerz, m",
"mohn => mohn, m",
"mahler => mahler, m"
]
}
},
"analyzer": {
"my_analyzer": {
"filter": [
"my_synonym"
],
"tokenizer": "whitespace"
}
}
}
}
}
}
PUT /_bulk
{ "index" : { "_index" : "test_index", "_id" : "1" } }
{"name": "anna meyer", "name_completion": {"input": "anna meyer", "weight": 1}}
{ "index" : { "_index" : "test_index", "_id" : "2" } }
{"name": "anna mueller", "name_completion": {"input": "anna mueller", "weight": 2}}
{ "index" : { "_index" : "test_index", "_id" : "3" } }
{"name": "anna mann", "name_completion": {"input": "anna mann", "weight": 3}}
{ "index" : { "_index" : "test_index", "_id" : "4" } }
{"name": "anna murnau", "name_completion": {"input": "anna murnau", "weight": 4}}
{ "index" : { "_index" : "test_index", "_id" : "5" } }
{"name": "anna munch", "name_completion": {"input": "anna munch", "weight": 5}}
{ "index" : { "_index" : "test_index", "_id" : "6" } }
{"name": "anna myerz", "name_completion": {"input": "anna myerz", "weight": 6}}
{ "index" : { "_index" : "test_index", "_id" : "7" } }
{"name": "anna mohn", "name_completion": {"input": "anna mohn", "weight": 7}}
{ "index" : { "_index" : "test_index", "_id" : "8" } }
{"name": "anna mahler", "name_completion": {"input": "anna mahler", "weight": 8}}
{ "index" : { "_index" : "test_index", "_id" : "9" } }
{"name": "anna meier", "name_completion": {"input": "anna meier", "weight": 9}}
On at least 7.3 , getting the top 5 suggestions for "anna":
POST /test_index/_search
{
"suggest": {
"test-suggest" : {
"prefix" : "anna",
"completion" : {
"field" : "name_completion",
"size": 5
}
}
}
}
returns the following five suggestions for _id
9, 8, 7, 6, 4 in decending order of weights. However there is doc 5 with weight 5 which should appear before doc 4. Increasing the "size" to 10 returns document ids 9 to 2 in correct order, but is still missing doc 1, which again only appears when querying for more than 12 suggestions.
I opened https://github.com/apache/lucene-solr/pull/913 to fix some of the underlying issues in Lucene. We'd also need to change our own TopSuggestGroupDocsCollector#collect
method to correctly signal document rejections after that change.
Hi team, can we have this fix backported to 6.x (e.g. 6.8)?
Pinging @elastic/es-search (Team:Search)
Pinging @elastic/es-search-relevance (Team:Search Relevance)
It was observed that when restricting completion suggestions to a certain size n, the top suggestions returned can miss out on certain suggestions that otherwise appear in the top n results when querying for a larger return window (by increasing size). This was particularly observed when there was more than one shard and and the analyzer on the suggest field produced multiple token in the same location. This can lead to multiple paths in the Lucene suggester datastructure leading to the same doc id.