elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.51k stars 24.89k forks source link

Completion Suggester V2 #10746

Closed areek closed 9 years ago

areek commented 9 years ago

Completion Suggester V2

The completion suggester provides auto-complete/search-as-you-type functionality. This is a navigational feature to guide users to relevant results as they are typing, improving search precision. It is not meant for spell correction or did-you-mean functionality like the term or phrase suggesters.

The completions are indexed as a weighted FST (finite state transducer) to provide fast Top N prefix-based searches suitable for serving relevant results as a user types.

Notable Features:

Completion Suggester V2 is based on LUCENE-6339 and LUCENE-6459, the first iteration of Lucene's new suggest API.

Mapping

The completion fields are indexed in a special way, hence a field mapping has to be defined. Following shows a field mapping for a completion field named _titlesuggest:

PUT {INDEX_NAME}
{
 "mappings": {
  {TYPE_NAME}: {
   "properties": {
    "title_suggest": {
     "type": "completion"
   }
  }
 }
}

You can choose index and search time analyzer for the completion field by adding analyzer and search_analyzer options.

Context Mappings

Adding a contexts option in the field mapping defines a context-enabled completion field. You may want a context-enabled completion field, if you require filtering or boosting suggestions by a criteria other than just its prefix. Note that adding high-cardinality context values will increase the size of the in-memory index significantly.

There are two types of supported context types: category and geo.

Category Context Mapping

Category contexts are indexed as prefixes to the completion field value.

The following adds a category context named genre:

...
"contexts": [
 {
   "name": "genre",
   "type": "category"
 }
]

You can also pull context values from another field in a document by using a path option specifying the field name.

Geo Context Mapping

Geo points are encoded as geohash strings and prefixed to the completion field value. The following adds a geo context named location:

...
"contexts": [
 {
   "name": "location",
   "type": "geo"
 }
]

You can also set precision option to choose the geohash length and path to pull context values from another field in the document.

Indexing

Just like any other field, you can add multiple completion fields to a document. You can also index multiple completions for a completion field per document. Each completion value is tied to its document and can be assigned an index-time weight, which determines its relative rank among other completion values which share a common prefix.

The following indexes a completion value and its weight for the _titlesuggest completion field:

POST {INDEX_NAME}/{TYPE_NAME}
{
 "title_suggest": {
  "input": "title1",
  "weight": 7
 }
}

You can use the short-form, if you prefer not to add weight to the completions:

POST {INDEX_NAME}/{TYPE_NAME}
{
 "title_suggest": "title1",
}

Arrays are also supported to index multiple values,

The following indexes multiple completion entries (input and weight) for a single document:

POST {INDEX_NAME}/{TYPE_NAME}
{
 "title_suggest": [
  {
   "input": "title1",
   "weight": 14
  },
  {
   "input": "alternate_title",
   "weight": 7
  }
 ]
}

Indexing context-enabled fields

You can use the path option previously mentioned to pull context values from another field in the document or add contexts option to the completion entry while indexing.

The following explicitly indexes context values along with completions:

POST {INDEX_NAME}/{TYPE_NAME}
{
 "genre_title_suggest": {
  "input": "title1",
  "contexts": {
   "genre": ["genre1", "genre2"]
  },
  "weight": 7
 }
}

You can also configure the path option in the context mapping to pull values from another field as follows (assuming path for the genre context has been set to genre field):

POST {INDEX_NAME}/{TYPE_NAME}
{
 "genre_title_suggest": "title1",
 "genre": ["genre1", "genre2"]
}

Query Interface

The point of indexing values as completions is to be able to run fast prefix-based searches on them. You can run Prefix, Fuzzy and Regex queries on all completion fields. In case of a context- enabled completion field, providing no context indicates all contexts will be considered. But you can not run a Context query on a completion field with no contexts. When a query is run on a context- enabled field, the contexts for a completion is returned with the suggestion.

Prefix Query

The following suggests completions from the field _titlesuggest that start with titl:

POST {INDEX}/_suggest
{
 "suggest-namespace" : {
  "prefix" : "titl",
  "completion" : {
   "field" : "title_suggest"
  }
 }
}

The suggestions are sorted by their index-time weight.

Fuzzy Prefix Query

A fuzzy prefix query can serve typo-tolerant suggestions. It scores suggestions closer (based on its edit distance) to the provided prefix higher, regardless of their weight.

POST {INDEX}/_suggest
{
 "suggest-namespace" : {
  "prefix" : "sug",
  "completion" : {
   "field" : "suggest",
   "fuzzy" : {        (1)
    "fuzziness" : 2
   }
  }
 }
}

Specify fuzzy as shown in (1) to use typo-tolerant suggester. Full options for fuzzy

Regex Prefix Query

A regex prefix query matches all the term prefixes that match a regular expression. Regex is anchored at the begining but not at the end. The suggestions are sorted by their index-time weight.

POST {INDEX}/_suggest
{
 "suggest-namespace" : {
  "regex" : "s[u|a]g",    (1)
  "completion" : {
   "field" : "suggest"
  }
 }
}

Specify regex as shown in (1), instead of prefix to use regular expressions. Supported regular expression syntax

Context Query

Adding contexts (1) option to the query enables filtering and/or boosting suggestions based on their context values. This query scores suggestions by multiplying the query-time boost withe the suggestion weight.

POST {INDEX}/_suggest
{
 "suggest-namespace" : {
  "prefix" : "sug",
  "completion" : {
   "field" : "genre_title_suggest",
   "contexts": {           (1)
    "genre": [
     {
      "value" : "rock", 
      "boost" : 3
     },
     {
      "value" : "indie",
      "boost" : 2
     }
    ]
   }
  }
 }
}

The contexts can also be specified without any boost:

  ...
  "contexts": {
    "genre" : ["rock", "indie"]
  }

Geo Context Query:

The result will be scored such that the suggestions are first sorted by the distance between the corresponding geo context and the provided geo location and then by the weight of the suggestions.

  ...
  "contexts" : {
    "location" : {
      "context" : {
        "lat" : ..,
        "lon" : ..
      },
      "precision" : ..
    }
  }

Example

The following performs a Fuzzy Prefix Query combined with a Context Query on a context-enabled completion field named _genre_songsuggest.

POST {INDEX}/_suggest
{
 "suggest-namespace" : {
  "prefix" : "like a roling st",
  "completion" : {
   "field" : "genre_song_suggest",
   "fuzzy" : {
    "fuzziness" : 2
   },
   "contexts" : {
    "genre": [
     {
      "context" : "rock", 
      "boost" : 3
     },
     {
      "context" : "indie",
      "boost" : 2
     }
    ]
   }
  }
 }
}

This query will return all song names for the genre rock and indie that are within an edit distance of 2 from the prefix like a roling st. The song names with genre of rock will be boosted higher then that of indie. The completion field values that share the longest prefix with like a roling st will be additionally boosted higher.

Payload

You can retrieve any document field values along with its completions using the payload option. The following returns the url field with each suggestion entry:

POST {INDEX}/_suggest
{
 "suggest-namespace" : {
  "prefix" : "titl",
  "completion" : {
   "field" : "title_suggest",
   "payload" : ["url"]
  }
 }
}

The response format is as follows:

{
 ...
 "suggest-namespace" : [ 
  {
   "prefix" : "sugg",
   "offset" : 0,
   "length" : 4,
   "options" : [ 
    {
     "text" : "suggestion",
     "score" : 34.0, 
     "payload": {
       "url" : [ "url_1" ]
     }
    }
   ]
  } 
 ]
}
clintongormley commented 9 years ago

Hi @areek

I like the look of this a lot. Btw, at the top you say we can use an arbitrary filter, but then later you say only a prefix/regex/etc filter. Which is it?

Also, under "Context Query" you say "Prefix, Regex and Fuzzy queries can be used with this query." Can you provide an example of this?

clintongormley commented 9 years ago

@s1monw please could you review this as well.

s1monw commented 9 years ago

I like this a lot the only thing that I think is confusing is that we use text as the property name if it's prefix / fuzzy prefix, should we just name it prefix since we also use regex? I wonder if we finally plan to return the entire document here?

areek commented 9 years ago

@clintongormley & @s1monw thanks for taking a look!

I have updated the Query Interface section to clarify how multiple queries can be used together (includes example).

text has been renamed to prefix for prefix queries and regex for regex prefix queries. At the moment, _source and fields can be used to retrieve field values for all resulting documents (like the _search API).

Also updated the issue with link to LUCENE-6459, query interface for suggest API.

Let me know if there is still any confusion with the API.

clintongormley commented 9 years ago

Looking awesome. A few thoughts:

areek commented 9 years ago

Thanks @clintongormley for the feedback!

The search_analyzer should default to the analyzer setting - just checking that this is the case.

That is the case :)

We should probably remove the fuzzy wrapper and just allow specifying fuzziness directly?

If we remove the fuzzy wrapper, there would be no way for the user to specify transpositions, min_length, prefix_length and unicode_aware. I think it is useful to expose at least the min_length and prefix_length to the user?

For geo-context searching, you sort on distance and THEN on weight? Feels like we should be able to combine these two factors in a user configurable way.

We can allow this to be configurable (distance then weight or weight then distance) in the future. I think sorting the suggestions by distance and then weight for the first iteration is good?

clintongormley commented 9 years ago

If we remove the fuzzy wrapper, there would be no way for the user to specify transpositions, min_length, prefix_length and unicode_aware. I think it is useful to expose at least the min_length and prefix_length to the user?

OK, that's fine. Btw, do we really need unicode_aware? We use unicode everywhere, not sure why we would ever disable that?

We can allow this to be configurable (distance then weight or weight then distance) in the future. I think sorting the suggestions by distance and then weight for the first iteration is good?

OK

abhijitiitr commented 9 years ago

@areek Scoring for fuzzy queries is still doesn't not appear to be correct. Results having exact match with the input string for suggest queries are not having better scores than fuzzy matches. I guess you have noted it in the code itself. Code link Do you have any suggestions for handling fuzziness correctly in the proposed Completion Suggester?

ivpusic commented 9 years ago

what about duplicate_output option? It was mentioned here https://github.com/elastic/elasticsearch/issues/8909

areek commented 9 years ago

@abhijitiitr fuzziness scoring can not distinguish between a perfect match and a match with only the last character being a typo. You could sort the results you get back in your application to correctly score the fuzziness. Unfortunately, this is what I would suggest atm. BTW, see https://github.com/elastic/elasticsearch/pull/13659 for https://github.com/elastic/elasticsearch/pull/11740#issuecomment-123989933, now you can return values of document fields, along with each suggestion.

@ivpusic the duplicate_output has been removed. now each suggestion is associated with a unique document instead. If you have two identical inputs for two documents, you will get suggestions back both the documents with the same input.

abhijitiitr commented 9 years ago

@areek :thumbsup: for the payload feature.

For fuzzy queries, we have to fetch (> N) results for getting N correct results. In some cases I tested I had to set the size option as 5N. That's a bummer I guess but its a Lucene issue and I think it would be solved in future.

djschny commented 9 years ago

I have some concerns over performance with the change to remove the payload functionality. Let me try to elaborate and please let me know if I stated anything incorrectly.

With the current 1.X completion suggester, the entire FST (including payloads) is held on heap (but persisted to disk). By leveraging payloads we were able to strive extreme performance as we never had to do a FETCH of the associated docs for field values, parse the json, etc. This made for very fast response times.

My concern is not allowing clients who continue to want to make indices with completion suggester mappings that leverage payloads due to performance reasons. Instead I believe it would be really nice to default to the new approach, but one could still opt-in to using payloads. While payloads can be problematic if abused, many clients are smart about it and/or have small enough suggester FSTs that the extra memory associated with payloads is a non-issue compared to the performance expectation they have.

Hopefully there has been testing or other work that I missed reading this issue and the associated PRs that proves this a non-issue and apologies if I missed it. Just want to make sure we don't have surprises and regression for customers.

kabcampbell commented 9 years ago

Hi all, I am running up against the hard-coded limit to the number of context categories that a suggestion can have, as referenced in #9466. Is that limit still on track to be removed? Is this version of the completion suggester to be included in elastic 2.1? Thanks!

mikemccand commented 9 years ago

By leveraging payloads we were able to strive extreme performance as we never had to do a FETCH of the associated docs for field values, parse the json, etc. This made for very fast response times.

Hi @djschny, you are correct, that payloads are moving out of the FST to disk.

However, with this PR, we 1) pull the payloads from doc values, which is much more efficient than loading from re-parsing _source or loading stored fields, and 2) we do so eagerly in a single pass (no second fetch phase), though this may change in the future if we merge suggest into search implementation.

Also, the number of values we are pulling is typically tiny (the size passed in the suggest request, i.e. number of suggestions, default is 5). We pull them only once we have the top suggestions for that shard.

So while there will be some perf hit, I expect it won't matter in most use cases.

djschny commented 9 years ago

So while there will be some perf hit, I expect it won't matter in most use cases.

I agree as well. Where my concern lies is in the 20%:

Generally speaking I tend to always follow the "golden rule" that says to "always put the user in control". So therefore allowing our users to choose which approach works best for them based upon a tradeoff of speed for increased memory, vs. large scale and off heap memory usage at the cost of some slight performance.

Hope I'm not being too noisy, but would hate to see it be an unpleasant surprise or an upgrade blocker for that 20% of our user base. One of the things that has made Elasticsearch so wonderfully accepted by everyone was that users had options to configure and use it for what made sense for their particular requirements/needs (since everyone is different) and just don't want to loose that.

djschny commented 9 years ago

Was this prematurely closed? I'm still pretty concerned about removing the payloads ability the users have now without concrete numbers that show the speed is the same.

areek commented 9 years ago

@djschny This was closed as the feature is merged to master and will be available for v2.2, but is still open for discussion. I will benchmark the feature and update this issue with the benchmark results.

areek commented 9 years ago

@djschny I have benchmarked the completion suggester on a single shard environment using the geonames dataset of ~9.3 million city names. The FST size for the dataset was ~201.5 mb. The following shows the comparison of search performance (in KQPS) of the completion suggester (with and without using the newly added query-time doc-values based payload) and an equivalent prefix query with increasing number of prefix length: benchmark_completion_master

The completion suggester was at most +19KQPS faster and at least +9KQPS faster compared with its prefix query counterpart for the prefix lengths of 1 to 6. Retrieving a non-analyzed doc-values based string payload at query-time was at most -6KQPS slower and at least -2KQPS slower compared to completion suggestions without payload.

The same benchmark was run on pre-2.2 completion suggester. As payloads was an index-time setting the dataset produced a FST size of ~201.5mb without payload and ~352.5mb with payload. The alternate names of the cities were used for the payload values: benchmar_completion_2 2

The pre-2.2 completion suggester was at most +19KQPS faster and at least +11KQPS faster compared with its prefix query counterpart and completions with index-time payloads were at most -3KQPS slower and at least -1KQPS slower than completions with no payloads.

Note that the overall trend of the benchmarks are more interesting than the absolute KQPS values, as they seem to vary between runs unlike the overall trends seen in the graphs.

Overall the query-time doc-values based payloads are slower than the previous index-time payloads that were stored as part of the FST, this is as expected and IMO justified by ensuring that the FST size is only bounded by its inputs instead of arbitrary index-time payload values (that could be anything from a small id to a full-blown json object) and the provided flexibility of specifying one or more query-time payloads. The current infrastructure also makes it possible to use FST index directly in the query DSL and for aggregations in the future. IMO, if needed, in a future optimization we can expose the index-time payloads option as a more sane _cache option, where users can specify a single field to be stored in the FST for speeding up completions with payloads.

simplechris commented 9 years ago

Generally speaking I tend to always follow the "golden rule" that says to "always put the user in control". So therefore allowing our users to choose which approach works best for them based upon a tradeoff of speed for increased memory, vs. large scale and off heap memory usage at the cost of some slight performance - djschny

:+1: Yeah, I would like to have this option. I'm willing to use more memory to have my payloads stored within the FST.

... where users can specify a single field to be stored in the FST for speeding up completions with payloads. - areek

This sounds like a good compromise. Without it, honestly, I'd be very apprehensive about upgrading.

The rest of the ticket and progress sounds so exciting, I believe we should document this trade-off and allow users to make the informed decision of if/how to store their payloads. Let's keep the user in ultimate control, even if the default behaviour changes to best serve the 80% usecase. :)

djschny commented 9 years ago

@areek Thanks for al the numbers, much appreciated. In your examples, do you have the average response time associated with performing a completion suggestion with the various prefix lengths? That is the more important measurement that I was expecting to see, as that results in the end user experience.

jimczi commented 9 years ago

Overall the query-time doc-values based payloads are slower than the previous index-time payloads that were stored as part of the FST, this is as expected and IMO justified by ensuring that the FST size is only bounded by its inputs instead of arbitrary index-time payload values (that could be anything from a small id to a full-blown json object) and the provided flexibility of specifying one or more query-time payloads. - areek

Have you considered replacing the BytesRef in the FST by a pointer to a file where the payloads would be written ? Something like Pair<Long, Long> where the first value would still be the weight but the second value would be a pointer to the start of the payload for this entry. This would ensure that the FST size is only bounded by its inputs and also speed up the performance compared to the docvalues case (though it would rely on mmap reads in both cases).

s1monw commented 9 years ago

ave you considered replacing the BytesRef in the FST by a pointer to a file where the payloads would be written ? Something like Pair<Long, Long> where the first value would still be the weight but the second value would be a pointer to the start of the payload for this entry. This would ensure that the FST size is only bounded by its inputs and also speed up the performance compared to the docvalues case (though it would rely on mmap reads in both cases)

isn't this what we do now with putting the payloads in docvalues? I mean we use the doc ID to reference a payload on disk?

jimczi commented 9 years ago

@s1monw right but my point is that we can make the FST size only bounded by its input without the docvalues. Having the docvalues in the process adds an indirection (docid => file_ptr = > payload) that the direct pointer inside the FST could save.

s1monw commented 9 years ago

@s1monw right but my point is that we can make the FST size only bounded by its input without the docvalues. Having the docvalues in the process adds an indirection (docid => file_ptr = > payload) that the direct pointer inside the FST could save.

this breaks all feature isolation with no benefit IMO. We are using a compressed integer to reference the payload (the docID) all optimization on top of this is premature and IMO of no benefit. We would need to invent and maintain yet another file format which is not sustainable in our environment. IMO the doc-values indirection is already as specialized as it gets here. And remember we only fetching this for the topN.

areek commented 9 years ago

In your examples, do you have the average response time associated with performing a completion suggestion with the various prefix lengths? That is the more important measurement that I was expecting to see, as that results in the end user experience.

@djschny I think average response time for various prefix lengths is misleading when we are concerned with completion suggester's latency. If latency is a concern for completions, IMO the right question to ask is "What is the highest latency I can tolerate?" instead. From the result, we can see that latency decreases as the prefix length increases (as expected), hence the QPS for a prefix length of 1 is the most important measure in the benchmark, the rest just shows the relationship between the latency and increasing prefix lengths. So for this dataset, it happens to be in a range of 20 to 18 KQPS. In contrast, if we measure the average response time it would be somewhere in the low 30s in KQPS, which does not really help when you always see higher latency for smaller prefixes, which is the main use case for using completions to begin with.

djschny commented 8 years ago

@areek That is exactly my point, some users may not be able to tolerate the highest latency. For example if folks see their average completion suggest response time go from 5 millis to 20 millis due to the hit on disk to fetch from _source, then they would ideally want the original behavior of storing a payload in the FST.

Most importantly, today you can make an FST and not actually have any relationship to source documents stored in ES. With the new change you cannot make a completion suggester with payloads but have your actual documents stored in a different system. Instead you would need to make fake/dummy documents that just contained the same fields that you would ultimately want in the FST. My apologies if I'm missing something there.

JeremyBYU commented 8 years ago

I'm having a difficult time understanding how to use the new payloads option. I see that we are supposed to "store" our payload on the actual document now and access these properties through the payloads: [property' syntax. However my payload is a nested JSON document that has multiple fields, many of which are optional. So I map this JSON document to a property but I am forced to access the payload in this awkward way:

        "prefix": "coupon mi",
        "completion": {
            "payload": ["search.prop1.nested.prop2", "search.prop2.nestedprop3"], 
            "field": "suggest",
            "fuzzy":{},
            "size": 25
        }

Is there any way to just tell the suggestion engine to just return back a payload that is simply a nested JSON object? Or maybe more clearly is there any way to not index a JSON object on the property of a document? So that when the suggestion engine 'looks up' a property it just return the JSON.

areek commented 8 years ago

Hey @JeremyBYU

Is there any way to just tell the suggestion engine to just return back a payload that is simply a nested JSON object?

Currently no, you have to specify the payload fields as you are currently doing. But we plan to return the entire document a suggestion hit is associated with. When that is supported, you can use source filtering to exclude/include fields from the document.

In general I would suggest using a dedicated index for suggestions, so the associated suggest documents contain fields only relevant for retrieving at suggest time. This might simplify the document structure and avoid the need to specify nested json fields as payloads altogether.

JeremyBYU commented 8 years ago

Thanks for the explanation. I was indeed using a dedicated index. The structure of the object is important, so if I flatten it into one document, I will have to regenerate it for the client. Thanks

Antonhansel commented 8 years ago

Just dug up this issue, I'm trying to get duplicates documents. I'm not sure about this answer from @areek, does it mean it should return duplicate documents?

jacktuck commented 7 years ago

Theres a lot of noise following an issue (I won't make more noise by linking) where deleted documents were not getting filtered out.

I have been having a similar issue where updates to payloads were not propagating and im thinking that is the same issue. Could someone let me know what the state of that is in the current suggester?

Are deletions taken into consideration? Are updates to payloads taken into consideration?

@areek

jacktuck commented 7 years ago

Neve mind, I found https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking_50_suggester.html

tigra commented 6 years ago

@areek

fuzziness scoring can not distinguish between a perfect match and a match with only the last character being a typo

Can you provide an example where this happens? Not sure that I am understanding the phrase correctly.

P.S.: Wanting being able to have edit distance influencing the score very much.