Closed areek closed 9 years ago
Hi @areek
I like the look of this a lot. Btw, at the top you say we can use an arbitrary filter, but then later you say only a prefix/regex/etc filter. Which is it?
Also, under "Context Query" you say "Prefix, Regex and Fuzzy queries can be used with this query." Can you provide an example of this?
@s1monw please could you review this as well.
I like this a lot the only thing that I think is confusing is that we use text
as the property name if it's prefix / fuzzy prefix, should we just name it prefix
since we also use regex
? I wonder if we finally plan to return the entire document here?
@clintongormley & @s1monw thanks for taking a look!
I have updated the Query Interface section to clarify how multiple queries can be used together (includes example).
text
has been renamed to prefix
for prefix queries and regex
for regex prefix queries. At the moment, _source
and fields
can be used to retrieve field values for all resulting documents (like the _search
API).
Also updated the issue with link to LUCENE-6459, query interface for suggest API.
Let me know if there is still any confusion with the API.
Looking awesome. A few thoughts:
search_analyzer
should default to the analyzer
setting - just checking that this is the case.fuzzy
wrapper and just allow specifying fuzziness
directly?Thanks @clintongormley for the feedback!
The search_analyzer should default to the analyzer setting - just checking that this is the case.
That is the case :)
We should probably remove the fuzzy wrapper and just allow specifying fuzziness directly?
If we remove the fuzzy wrapper, there would be no way for the user to specify transpositions
, min_length
, prefix_length
and unicode_aware
. I think it is useful to expose at least the min_length
and prefix_length
to the user?
For geo-context searching, you sort on distance and THEN on weight? Feels like we should be able to combine these two factors in a user configurable way.
We can allow this to be configurable (distance then weight or weight then distance) in the future. I think sorting the suggestions by distance and then weight for the first iteration is good?
If we remove the fuzzy wrapper, there would be no way for the user to specify transpositions, min_length, prefix_length and unicode_aware. I think it is useful to expose at least the min_length and prefix_length to the user?
OK, that's fine. Btw, do we really need unicode_aware
? We use unicode everywhere, not sure why we would ever disable that?
We can allow this to be configurable (distance then weight or weight then distance) in the future. I think sorting the suggestions by distance and then weight for the first iteration is good?
OK
@areek Scoring for fuzzy queries is still doesn't not appear to be correct. Results having exact match with the input string for suggest queries are not having better scores than fuzzy matches. I guess you have noted it in the code itself. Code link Do you have any suggestions for handling fuzziness correctly in the proposed Completion Suggester?
what about duplicate_output
option? It was mentioned here https://github.com/elastic/elasticsearch/issues/8909
@abhijitiitr fuzziness scoring can not distinguish between a perfect match and a match with only the last character being a typo. You could sort the results you get back in your application to correctly score the fuzziness. Unfortunately, this is what I would suggest atm. BTW, see https://github.com/elastic/elasticsearch/pull/13659 for https://github.com/elastic/elasticsearch/pull/11740#issuecomment-123989933, now you can return values of document fields, along with each suggestion.
@ivpusic the duplicate_output
has been removed. now each suggestion is associated with a unique document instead. If you have two identical input
s for two documents, you will get suggestions back both the documents with the same input.
@areek :thumbsup: for the payload feature.
For fuzzy queries, we have to fetch (> N) results for getting N correct results. In some cases I tested I had to set the size
option as 5N. That's a bummer I guess but its a Lucene issue and I think it would be solved in future.
I have some concerns over performance with the change to remove the payload
functionality. Let me try to elaborate and please let me know if I stated anything incorrectly.
With the current 1.X completion suggester, the entire FST (including payloads) is held on heap (but persisted to disk). By leveraging payloads we were able to strive extreme performance as we never had to do a FETCH of the associated docs for field values, parse the json, etc. This made for very fast response times.
My concern is not allowing clients who continue to want to make indices with completion suggester mappings that leverage payloads due to performance reasons. Instead I believe it would be really nice to default to the new approach, but one could still opt-in to using payloads. While payloads can be problematic if abused, many clients are smart about it and/or have small enough suggester FSTs that the extra memory associated with payloads is a non-issue compared to the performance expectation they have.
Hopefully there has been testing or other work that I missed reading this issue and the associated PRs that proves this a non-issue and apologies if I missed it. Just want to make sure we don't have surprises and regression for customers.
Hi all, I am running up against the hard-coded limit to the number of context categories that a suggestion can have, as referenced in #9466. Is that limit still on track to be removed? Is this version of the completion suggester to be included in elastic 2.1? Thanks!
By leveraging payloads we were able to strive extreme performance as we never had to do a FETCH of the associated docs for field values, parse the json, etc. This made for very fast response times.
Hi @djschny, you are correct, that payloads are moving out of the FST to disk.
However, with this PR, we 1) pull the payloads from doc values, which is much more efficient than loading from re-parsing _source
or loading stored fields, and 2) we do so eagerly in a single pass (no second fetch phase), though this may change in the future if we merge suggest into search implementation.
Also, the number of values we are pulling is typically tiny (the size
passed in the suggest request, i.e. number of suggestions, default is 5). We pull them only once we have the top suggestions for that shard.
So while there will be some perf hit, I expect it won't matter in most use cases.
So while there will be some perf hit, I expect it won't matter in most use cases.
I agree as well. Where my concern lies is in the 20%:
Generally speaking I tend to always follow the "golden rule" that says to "always put the user in control". So therefore allowing our users to choose which approach works best for them based upon a tradeoff of speed for increased memory, vs. large scale and off heap memory usage at the cost of some slight performance.
Hope I'm not being too noisy, but would hate to see it be an unpleasant surprise or an upgrade blocker for that 20% of our user base. One of the things that has made Elasticsearch so wonderfully accepted by everyone was that users had options to configure and use it for what made sense for their particular requirements/needs (since everyone is different) and just don't want to loose that.
Was this prematurely closed? I'm still pretty concerned about removing the payloads ability the users have now without concrete numbers that show the speed is the same.
@djschny This was closed as the feature is merged to master and will be available for v2.2, but is still open for discussion. I will benchmark the feature and update this issue with the benchmark results.
@djschny I have benchmarked the completion suggester on a single shard environment using the geonames dataset of ~9.3 million city names. The FST size for the dataset was ~201.5 mb.
The following shows the comparison of search performance (in KQPS) of the completion suggester (with and without using the newly added query-time doc-values based payload
) and an equivalent prefix
query with increasing number of prefix length:
The completion suggester was at most +19KQPS
faster and at least +9KQPS
faster compared with its prefix query counterpart for the prefix lengths of 1 to 6. Retrieving a non-analyzed doc-values based string payload
at query-time was at most -6KQPS
slower and at least -2KQPS
slower compared to completion suggestions without payload
.
The same benchmark was run on pre-2.2 completion suggester. As payloads
was an index-time setting the dataset produced a FST size of ~201.5mb without payload and ~352.5mb with payload. The alternate names of the cities were used for the payload
values:
The pre-2.2 completion suggester was at most +19KQPS
faster and at least +11KQPS
faster compared with its prefix query counterpart and completions with index-time payloads were at most -3KQPS
slower and at least -1KQPS
slower than completions with no payloads.
Note that the overall trend of the benchmarks are more interesting than the absolute KQPS values, as they seem to vary between runs unlike the overall trends seen in the graphs.
Overall the query-time doc-values based payloads are slower than the previous index-time payloads that were stored as part of the FST, this is as expected and IMO justified by ensuring that the FST size is only bounded by its inputs instead of arbitrary index-time payload values (that could be anything from a small id to a full-blown json object) and the provided flexibility of specifying one or more query-time payloads. The current infrastructure also makes it possible to use FST index directly in the query DSL and for aggregations in the future. IMO, if needed, in a future optimization we can expose the index-time payloads
option as a more sane _cache
option, where users can specify a single field to be stored in the FST for speeding up completions with payloads.
Generally speaking I tend to always follow the "golden rule" that says to "always put the user in control". So therefore allowing our users to choose which approach works best for them based upon a tradeoff of speed for increased memory, vs. large scale and off heap memory usage at the cost of some slight performance - djschny
:+1: Yeah, I would like to have this option. I'm willing to use more memory to have my payloads stored within the FST.
... where users can specify a single field to be stored in the FST for speeding up completions with payloads. - areek
This sounds like a good compromise. Without it, honestly, I'd be very apprehensive about upgrading.
The rest of the ticket and progress sounds so exciting, I believe we should document this trade-off and allow users to make the informed decision of if/how to store their payloads. Let's keep the user in ultimate control, even if the default behaviour changes to best serve the 80% usecase. :)
@areek Thanks for al the numbers, much appreciated. In your examples, do you have the average response time associated with performing a completion suggestion with the various prefix lengths? That is the more important measurement that I was expecting to see, as that results in the end user experience.
Overall the query-time doc-values based payloads are slower than the previous index-time payloads that were stored as part of the FST, this is as expected and IMO justified by ensuring that the FST size is only bounded by its inputs instead of arbitrary index-time payload values (that could be anything from a small id to a full-blown json object) and the provided flexibility of specifying one or more query-time payloads. - areek
Have you considered replacing the BytesRef in the FST by a pointer to a file where the payloads would be written ? Something like Pair<Long, Long> where the first value would still be the weight but the second value would be a pointer to the start of the payload for this entry. This would ensure that the FST size is only bounded by its inputs and also speed up the performance compared to the docvalues case (though it would rely on mmap reads in both cases).
ave you considered replacing the BytesRef in the FST by a pointer to a file where the payloads would be written ? Something like Pair<Long, Long> where the first value would still be the weight but the second value would be a pointer to the start of the payload for this entry. This would ensure that the FST size is only bounded by its inputs and also speed up the performance compared to the docvalues case (though it would rely on mmap reads in both cases)
isn't this what we do now with putting the payloads in docvalues? I mean we use the doc ID to reference a payload on disk?
@s1monw right but my point is that we can make the FST size only bounded by its input without the docvalues. Having the docvalues in the process adds an indirection (docid => file_ptr = > payload) that the direct pointer inside the FST could save.
@s1monw right but my point is that we can make the FST size only bounded by its input without the docvalues. Having the docvalues in the process adds an indirection (docid => file_ptr = > payload) that the direct pointer inside the FST could save.
this breaks all feature isolation with no benefit IMO. We are using a compressed integer to reference the payload (the docID) all optimization on top of this is premature and IMO of no benefit. We would need to invent and maintain yet another file format which is not sustainable in our environment. IMO the doc-values indirection is already as specialized as it gets here. And remember we only fetching this for the topN.
In your examples, do you have the average response time associated with performing a completion suggestion with the various prefix lengths? That is the more important measurement that I was expecting to see, as that results in the end user experience.
@djschny I think average response time for various prefix lengths is misleading when we are concerned with completion suggester's latency. If latency is a concern for completions, IMO the right question to ask is "What is the highest latency I can tolerate?" instead. From the result, we can see that latency decreases as the prefix length increases (as expected), hence the QPS for a prefix length of 1 is the most important measure in the benchmark, the rest just shows the relationship between the latency and increasing prefix lengths. So for this dataset, it happens to be in a range of 20
to 18
KQPS. In contrast, if we measure the average response time it would be somewhere in the low 30s in KQPS, which does not really help when you always see higher latency for smaller prefixes, which is the main use case for using completions to begin with.
@areek That is exactly my point, some users may not be able to tolerate the highest latency. For example if folks see their average completion suggest response time go from 5 millis to 20 millis due to the hit on disk to fetch from _source, then they would ideally want the original behavior of storing a payload in the FST.
Most importantly, today you can make an FST and not actually have any relationship to source documents stored in ES. With the new change you cannot make a completion suggester with payloads but have your actual documents stored in a different system. Instead you would need to make fake/dummy documents that just contained the same fields that you would ultimately want in the FST. My apologies if I'm missing something there.
I'm having a difficult time understanding how to use the new payloads option. I see that we are supposed to "store" our payload on the actual document now and access these properties through the payloads: [property'
syntax.
However my payload is a nested JSON document that has multiple fields, many of which are optional. So I map this JSON document to a property but I am forced to access the payload in this awkward way:
"prefix": "coupon mi",
"completion": {
"payload": ["search.prop1.nested.prop2", "search.prop2.nestedprop3"],
"field": "suggest",
"fuzzy":{},
"size": 25
}
Is there any way to just tell the suggestion engine to just return back a payload that is simply a nested JSON object? Or maybe more clearly is there any way to not index a JSON object on the property of a document? So that when the suggestion engine 'looks up' a property it just return the JSON.
Hey @JeremyBYU
Is there any way to just tell the suggestion engine to just return back a payload that is simply a nested JSON object?
Currently no, you have to specify the payload fields as you are currently doing. But we plan to return the entire document a suggestion hit is associated with. When that is supported, you can use source filtering to exclude/include fields from the document.
In general I would suggest using a dedicated index for suggestions, so the associated suggest documents contain fields only relevant for retrieving at suggest time. This might simplify the document structure and avoid the need to specify nested json fields as payloads altogether.
Thanks for the explanation. I was indeed using a dedicated index. The structure of the object is important, so if I flatten it into one document, I will have to regenerate it for the client. Thanks
Just dug up this issue, I'm trying to get duplicates documents. I'm not sure about this answer from @areek, does it mean it should return duplicate documents?
Theres a lot of noise following an issue (I won't make more noise by linking) where deleted documents were not getting filtered out.
I have been having a similar issue where updates to payloads were not propagating and im thinking that is the same issue. Could someone let me know what the state of that is in the current suggester?
Are deletions taken into consideration? Are updates to payloads taken into consideration?
@areek
@areek
fuzziness scoring can not distinguish between a perfect match and a match with only the last character being a typo
Can you provide an example where this happens? Not sure that I am understanding the phrase correctly.
P.S.: Wanting being able to have edit distance influencing the score very much.
Completion Suggester V2
The completion suggester provides auto-complete/search-as-you-type functionality. This is a navigational feature to guide users to relevant results as they are typing, improving search precision. It is not meant for spell correction or did-you-mean functionality like the term or phrase suggesters.
The completions are indexed as a weighted FST (finite state transducer) to provide fast Top N prefix-based searches suitable for serving relevant results as a user types.
Notable Features:
payload
.regex
.fuzzy
.Completion Suggester V2 is based on LUCENE-6339 and LUCENE-6459, the first iteration of Lucene's new suggest API.
Mapping
The completion fields are indexed in a special way, hence a field mapping has to be defined. Following shows a field mapping for a completion field named _titlesuggest:
You can choose index and search time analyzer for the completion field by adding
analyzer
andsearch_analyzer
options.Context Mappings
Adding a
contexts
option in the field mapping defines a context-enabled completion field. You may want a context-enabled completion field, if you require filtering or boosting suggestions by a criteria other than just its prefix. Note that adding high-cardinality context values will increase the size of the in-memory index significantly.There are two types of supported context types:
category
andgeo
.Category Context Mapping
Category contexts are indexed as prefixes to the completion field value.
The following adds a category context named genre:
You can also pull context values from another field in a document by using a
path
option specifying the field name.Geo Context Mapping
Geo points are encoded as geohash strings and prefixed to the completion field value. The following adds a geo context named location:
You can also set
precision
option to choose the geohash length andpath
to pull context values from another field in the document.Indexing
Just like any other field, you can add multiple completion fields to a document. You can also index multiple completions for a completion field per document. Each completion value is tied to its document and can be assigned an index-time weight, which determines its relative rank among other completion values which share a common prefix.
The following indexes a completion value and its weight for the _titlesuggest completion field:
You can use the short-form, if you prefer not to add
weight
to the completions:Arrays are also supported to index multiple values,
The following indexes multiple completion entries (
input
andweight
) for a single document:Indexing context-enabled fields
You can use the
path
option previously mentioned to pull context values from another field in the document or addcontexts
option to the completion entry while indexing.The following explicitly indexes context values along with completions:
You can also configure the
path
option in the context mapping to pull values from another field as follows (assumingpath
for the genre context has been set to genre field):Query Interface
The point of indexing values as completions is to be able to run fast prefix-based searches on them. You can run Prefix, Fuzzy and Regex queries on all completion fields. In case of a context- enabled completion field, providing no context indicates all contexts will be considered. But you can not run a Context query on a completion field with no contexts. When a query is run on a context- enabled field, the contexts for a completion is returned with the suggestion.
Prefix Query
The following suggests completions from the field _titlesuggest that start with titl:
The suggestions are sorted by their index-time
weight
.Fuzzy Prefix Query
A fuzzy prefix query can serve typo-tolerant suggestions. It scores suggestions closer (based on its edit distance) to the provided prefix higher, regardless of their weight.
Specify fuzzy as shown in (1) to use typo-tolerant suggester. Full options for fuzzy
Regex Prefix Query
A regex prefix query matches all the term prefixes that match a regular expression. Regex is anchored at the begining but not at the end. The suggestions are sorted by their index-time
weight
.Specify regex as shown in (1), instead of prefix to use regular expressions. Supported regular expression syntax
Context Query
Adding
contexts
(1) option to the query enables filtering and/or boosting suggestions based on their context values. This query scores suggestions by multiplying the query-time boost withe the suggestion weight.The contexts can also be specified without any boost:
Geo Context Query:
The result will be scored such that the suggestions are first sorted by the distance between the corresponding geo context and the provided geo location and then by the weight of the suggestions.
Example
The following performs a Fuzzy Prefix Query combined with a Context Query on a context-enabled completion field named _genre_songsuggest.
This query will return all song names for the genre rock and indie that are within an edit distance of 2 from the prefix like a roling st. The song names with genre of rock will be boosted higher then that of indie. The completion field values that share the longest prefix with like a roling st will be additionally boosted higher.
Payload
You can retrieve any document field values along with its completions using the
payload
option. The following returns the url field with each suggestion entry:The response format is as follows: