Proposed simplifications to _suggest

anneveling commented 11 years ago

Recent changes to configure field suggestions (see http://www.elasticsearch.org/blog/you-complete-me/) are a step forward, but may be improved as follows.

I would suggest a default configuration that would work on any field (i.e. no special configuration needed in the mapping file).

Calling a suggest for a field would look like this:

curl -X POST localhost:9200/hotels/_suggest -d '{ "text": "m", "field": "title" }'

This would work on using all emitted tokens for the "title" field, whatever is specified there (analyzers, whether or not to lowercase, split) and autosuggest words from it. It's likely this would be too slow and indeed FSTs need to be stored at index time, and may need to be in the mapping file. If so, then not as a different type, but as an add-on flag:

{ "mappings": { "hotel": { "properties": { "title": { "type": "string", "store_suggest_fsts": true} } } } }

Note that the suggester would use the same field (query) analyzer on the input, so to also split the query!

This means that typing in "state un" would be analyzed in 2 tokens: "state" and "un" (because the title field is a default "string"). For both words the suggester could suggest alternatives, but by default it could do it only for the last word (assuming the user is still typing there). This could then match the word "union" if there is a document with "State of the Union" as the title. (of course also many other "un*" words).

This is different than the current implementation, because there the suggester only works for full matches from the beginning of the field: if we would have a "title_suggest": {"type":"completion"} field, then typing in "state un" would find NOTHING, because there is no title that matches "state un*"

This is better because it could work with 0 or very limited configuration, and also fits the use case of actual suggestions better (where we are not merely matching against a simple prefixable field, but against freetext).

It would be nice if the suggester could then also boost the "union" completion because there is a colocated "state" match, but that is left as an exercise for the ElasticSearch gurus ;-)

By deferring the definition of the atoms to match against to the existing, flexible analyzers, this gives this much more control. We could even add fuzziness or stemming to the match targets...

WDYT

s1monw commented 11 years ago

Anne, thanks for opening this issue. First of all I think I need to elaborate on how the actual suggestions work vs. a fulltext search. In a fulltext search you basically have the following steps:

tokenize
build a query (in your example "state" and "un*")
rewrite the query (in your case must:(state), should("unfold", "united", "unified", "unreal", "undefined", "under", ...) )
run the query and fined the union
return the result (stored fields)

in a suggestion you do things slighlty different

tokenize
build all possible paths through the tokenized phrase ("state un" in your case)
check if this is a prefix in the suggester
return the payload and the output form (this is important if you have different output forms than input forms like in asian languages or in certain domains)

when you build you datastructure you can use different tokenstreams like stopword tokensfiler to make "state of the union" work. If you use a naive stopword filter like on fulltext search you can easily bring in a bad user experience like somebody types state o and you return state of the union but once somebody types the next char you get nothing since you of is a stopword. So this already requires specialized token streams for suggest vs. search.

now let me go further and tell you what the most important property of suggestions is... speed!! You have 200ms at most! from the keystroke until the suggestion needs to be rendered! That means including latency from the client (that is even more important if you are on mobil devices), all the processing in your infra etc.

Ok seems doable but on a reasonable index a prefix query that is completely unbounded essentially is translated into a boolean query for all the expansions.... lets think of the state un case again or better state u were we run a prefix query for u ok that seems like fun.... and here our 200ms are gone already without latency taken into account. Now you can use edgeNGrams etc. but then you still need to score and load stored fields etc. and it requires certain configuration. Anyway.... it seems like there is a chance that we can make this work on an ordinary index, what could break it? I think what most users don't think about is that their infrastructure they run this on is build for N Queries Per Second but once you add suggest to it you suddenly get hit by N + N * AverageQueryLength and if you do that off a normal index even a single disk access can kill your entire user experience. With something like a pre-build dedicated in-memory structure you can easily survive this and then your suggestion / completion is actually useful.

if you want to make infix suggestions work, we are working on a suggest implementation that does that but it has several implications as well since it uses index sorting and has weight notions etc.

Aside of all the speed stuff, the biggest problem with running something on top of an ordinary index is that you don't have any notion of weight when you look at a term unfolded is as good as unreal or unified (note frequency is not a good indicator!) this means you literally need to look at all of them in all your segments...

By deferring the definition of the atoms to match against to the existing, flexible analyzers, this gives this much more control. We could even add fuzziness or stemming to the match targets...

you can do that with the completion suggester already! you can use stemmers and return the unstemmed version without loading stored fields. You can also use fuzzy matching (there is a nuts fast fuzzy prefix suggester already pushed to 0.90.4 you have all the flexibility any other analyzer would give you.

It would be nice if the suggester could then also boost the "union" completion because there is a colocated "state" match, but that is left as an exercise for the ElasticSearch gurus ;-)

I really appreciate that you left the hard part to us :) ...nevermind I can see that this sounds like the most likely thing to do but it's often in practice not feasible. if you have a small corpus this might be the way to go but in the ES case it's unlikely that your corpus is very small.

anneveling commented 11 years ago

Sure, I understand (on the high level) what the new suggester does and how it is different from search.

In my example "state un" would match on "union" even without a stopword filter for the "of the" in "State of the Union", that was my whole point.

Sure, it is a challenge. Sure, it needs to be fast. But that's just saying: yes this would be nice, but we haven't found a way to make it lightning-fast like we require all ES functions to work. That all the implementations we can think of are not fast enough, does not mean it's not possible by doing something smart.

My main point is, that the suggester (old and new) does not work for multiword suggestions, only like a standard prefix query. And all my real life use cases have always requested to work on multiple words separately as well. I have done some work in query analyzers to make that work, and was hoping that ES would start to come out of the box with a similar solution to work for autocompletion of words in a field that usually contains entire sentences (like a title), not just for controlled-vocabulary fields.

And I know the best way to do this would be to implement this myself and create a pull request. Was just wondering whether you would feel a suggester that works as suggested [sic] would be valuable if it would be fast enough

s1monw commented 11 years ago

In my example "state un" would match on "union" even without a stopword filter for the "of the" in "State of the Union", that was my whole point. I can see your point - I just want to manage expectations here. This suggest stuff has many aspects and stopwords play an even greater role here. What do you do if somebody types a stopword? Do you scarify 1. performance and 2. relevance? It's very hard though.

And I know the best way to do this would be to implement this myself and create a pull request. Was just wondering whether you would feel a suggester that works as suggested [sic] would be valuable if it would be fast enough

sure go ahead I am already looking forward to review it.

My main point is, that the suggester (old and new) does not work for multiword suggestions, only like a standard prefix query.

actually I don't think this is near to the truth. This is much more flexible than a prefix query, it has weighting, it's blazing fast, it has analyzer support and it works perfectly for titles.

clintongormley commented 10 years ago

@areek Implementing the AnalyzingInfixSuggester could be the way to go here, although it wouldn't be config free?

clintongormley commented 9 years ago

Closing in favour of #13692

elastic / elasticsearch

Proposed simplifications to _suggest #3565