elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.97k stars 24.75k forks source link

Skip_duplicate in Autosuggest by custom-field #40397

Closed ravi-kumar-yadav closed 5 years ago

ravi-kumar-yadav commented 5 years ago

Elasticsearch version: Version: 6.5.0, Build: default/tar/816e6f6/2018-11-09T18:58:36.352602Z, JVM: 1.8.0_192

JVM version: openjdk version "11.0.2" 2019-01-15 OpenJDK Runtime Environment 18.9 (build 11.0.2+9) OpenJDK 64-Bit Server VM 18.9 (build 11.0.2+9, mixed mode)

OS version: Darwin Ravi-2.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64

Category: Autosuggest

Scenario: We have this functionality of skip_duplicates which allows us to filter out duplicate documents. By default, this feature exploits the text-suggestion for filtering duplicate documents, but we need it to happen based on custom-field present in doc like local_id in following case.

Steps to reproduce: Let's insert following docs, assuming field: suggest has type: completion. In following case all docs have same input i.e. Nevermind:

Create Mapping curl -X PUT -H "Content-type: application/json" 'localhost:9200/music' -d '{ "mappings": { "_doc" : { "properties" : { "suggest" : { "type" : "completion" }, "local_id" : { "type": "long" } } } } }'

Inserting Docs: curl -X PUT -H "Content-type: application/json" 'localhost:9200/music/_doc/1?refresh' -d '{ "local_id": 12, "suggest" : [ { "input": "Nevermind", "weight" : 10 } ] }'

curl -X PUT -H "Content-type: application/json" 'localhost:9200/music/_doc/2?refresh' -d '{ "local_id": 12, "suggest" : [ { "input": "Nevermind", "weight" : 9 } ] }'

curl -X PUT -H "Content-type: application/json" 'localhost:9200/music/_doc/3?refresh' -d '{ "local_id": 13, "suggest" : [ { "input": "Nevermind", "weight" : 6 } ] }'

Here following query with skip_duplicates: true: curl -X GET -H "Content-type: application/json" 'localhost:9200/music/_doc/_search?pretty' -d '{ "suggest": { "song-suggest": { "prefix": "never", "completion": { "field": "suggest", "skip_duplicates": true } } } }'

Actual/Current Behaviour The response would be one doc with the highest weight i.e. document with the doc_id 1 (as it has weight 10) and as the filtering criteria was the suggest-text: { "took" : 11, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 0, "max_score" : 0.0, "hits" : [ ] }, "suggest" : { "song-suggest" : [ { "text" : "never", "offset" : 0, "length" : 5, "options" : [ { "text" : "Nevermind", "_index" : "music", "_type" : "_doc", "_id" : "2", "_score" : 9.0, "_source" : { "local_id" : 12, "suggest" : [ { "input" : "Nevermind", "weight" : 9 } ] } } ] } ] } }

Expected Behaviour We would like to have a functionality to de-duplicate by user-requested or custom field. Like if we de-duplicate by field: local_id then there will be two documents as output (one for each local_id group and filtered by the highest weight within the group).

{ "took" : 11, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 0, "max_score" : 0.0, "hits" : [ ] }, "suggest" : { "song-suggest" : [ { "text" : "never", "offset" : 0, "length" : 5, "options" : [ { "text" : "Nevermind", "_index" : "music", "_type" : "_doc", "_id" : "2", "_score" : 9.0, "_source" : { "local_id" : 12, "suggest" : [ { "input" : "Nevermind", "weight" : 9 } ] } } ] }, { "text" : "never", "offset" : 0, "length" : 5, "options" : [ { "text" : "Nevermind", "_index" : "music", "_type" : "_doc", "_id" : "3", "_score" : 6.0, "_source" : { "local_id" : 13, "suggest" : [ { "input" : "Nevermind", "weight" : 6 } ] } } ] } ] } }

Let me know if we could do it with a small tweak in existing ES-6.5 source code Or Let's assume this field to be used will always be named: local_id (already present in doc or matched-text will be used by default), we can have this condition on field-name as we need this functionality on urgent basis.

Thanks in advance.

ravi-kumar-yadav commented 5 years ago

@jimczi, @javanna: I saw a similar PR that you worked on. Could you please help me with this issue ?

elasticmachine commented 5 years ago

Pinging @elastic/es-search

jimczi commented 5 years ago

Let me know if we could do it with a small tweak in existing ES-6.5 source code

No we can't, the completion suggester uses a data structure independent of the inverted lists so it is not possible to use fields outside of the suggester. One workaround I can think of would be to add the user_id as a suffix in your suggestion input. So you could replace Nevermind with Nevermind_user12 for instance and then retrieve the original value in the client by removing the suffix from the responses. The deduplication would work on the entire input (Nevermind_user12) so you would be able to filter suggestions that share the same surface form and user id at the cost of a small rewriting at ingest time and when handling the response.

ravi-kumar-yadav commented 5 years ago

Thanks @jimczi .

But I noticed that: Class: Option (org.elasticsearch.search.suggest.completion) takes docID in constructor, but it's always set to 0. If we could populate this with our local_id or any fix doc-field during Option-object creation and use it during de-duplication (instead of suggestion-input which is being used as default metric), then will it help ?

jimczi commented 5 years ago

The Option class is used to merge the results of different shards so this would not solve the original issue. As said above the completion suggester uses a different data structure where documents are not sorted by docID so it is not possible to use another field when we perform lookups. The deduplication based on the input can be fast because suggestions are indexed in a tree-like structure where we can prune the paths that we already visited. The solution here is to append the local_id field to your suggestion inputs as advised above. I hope you don't mind if I close this issue but the completion suggester is already quite complex and this feature can be implemented client-side with little effort.

ravi-kumar-yadav commented 5 years ago

@jimczi : Thanks for quick reply, I will try the approach shared above.