Suggesters: highlighting (explicit markup of user-typed portions vs. generated portions in a suggestion) [LUCENE-4518]

asfimport commented 12 years ago

As a user, I would like the lookup result of the suggestion engine to contain information which allows me to distinguish the user-entered portion from the autocompleted portion of a suggestion. That information can then be used for e.g. highlighting.

Notes:

It's trivial if the suggestion engine only applies simple prefix search, as then the user-typed prefix is always a true prefix of the completion. However, it's non-trivial as soon as you use an AnalyzingSuggester, where the completion may (in extreme cases) be quite different from the user-provided input. As soon as case/diacritics folding, script adaptation (kanji/hiragana) come into play, the completion is no longer guaranteed to be an extension of the query. Since the caller of the suggestion engine (UI) generally does not know the implementation details, the required information needs to be passed in the LookupResult.

Discussion on java-user:

> I haven't found a simple solution for the highlighting yet, > particularly when using AnalyzingSuggester (where it's non-trivial).

Mike McCandless:

Ahh I see ... it is challenging in that case. Hmm. Maybe open an issue for this as well, so we can discuss/iterate?

Migrated from LUCENE-4518 by Oliver Christ, updated Jan 16 2013 Attachments: LUCENE-4518.patch

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I don't really see highlighting fitting into suggesters. Suggesters are already too complicated: this is a UI problem that can be handled outside of lucene.

asfimport commented 12 years ago

Oliver Christ (migrated from JIRA)

I don't see how this can easily be addressed in the UI.

Go to http://www.google.de and enter "praefi" as the query. The top two completions are

präfi*x* präfi*nal*

Note that in both cases the prefix "präfi" is recognized although the query is "praefi".

To handle this in the UI, the UI layer would have to duplicate the Analyzer's logic about case and diacritics folding.

I agree that in general, this may not be possible at all, but in simpler cases (case folding, diacritics insensitivity) I should think it's feasible (but hard on the UI level).

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I tend to agree with Oliver. I have done similar things because the frontend lacks information here. I agree its non-trivial but we should provide it if we can. For absolute correctness you likely need to reanalyze and intersect with the automaton :/

asfimport commented 12 years ago

Oliver Christ (migrated from JIRA)

In a classic FST, once you consumed the input and start looking for completions, you'd know how many input symbols you consumed, and how many output symbols you've collected so far, and how many symbols the TopNSearcher appends (i.e. how long, in characters, the completed portion of the string is). That information should be sufficient to explicitly distinguish the two parts. As long as completions don't "surround" the user-entered portions (google: "sox ticket purc"), or the prefix for some reason ends in the "middle" of a UTF8 byte sequence, this may be sufficient to cover basic use cases and put the length of the covered prefix (or completed suffix) into each LookupResult. I'm assuming that the input and output symbols are "reasonably aligned" in the transition labels, which may not be the case in the current implementation (I haven't gotten to that level of detail yet :-( ).

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

It'd be quite easy to back trace on each topN path to get the point at which it "started" (= the end of where we matched based on the user's input).

The challenge is ... that's the analyzed form, not the surface form; in general we need the reverse mapping from analyzed form offsets back to surface form offsets ... and the OffsetAttribute gives us that, but unfortunately only gives us start/end of each token.

Another challenge is we convert the analyzed form into a graph (TokenStreamToAutomaton), so we'd somehow need to get the surface form offsets through there too.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Actually start/end offset of a each token is probably sufficient here? Because the analyzed form of the partial token the user typed will show the end offset we want, I think?

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Hmm... it could be that if we simply record the partial output (surface form) we've accumulated so far, when we add a start path into the TopNSearcher, that this could make a good hilite candidate.

The FST will always output "eagerly", meaning on seeing a given partial input, it will output as much as is unambiguously possible. So I suspect the equivalent in Lucene of the "praefi" example would just work.

The only problem I can think of where this won't work is if the completion is [somewhat] deterministic. EG if you only had added "electron" and "electronics" to your suggester, and user has typed only 'e' so far, the output on traversing only 'e' would be electron, which is way too much to hilite. But in a "real" app, where there are tons and tons of suggestions, I suspect this would become a vanishingly minor issue.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Initial patch, that just records the starting output from when the path was added to the queue, and uses that to set the new prefixLength in LookupResult.

Currently I only fixed WFST, Analyzing and Fuzzy suggesters to set this. WFST will always be correct (it's trivial), while Analyzing/Fuzzy can sometimes be too long (if there is a common prefix to all possible completions from that starting point).

asfimport commented 11 years ago

Oliver Christ (migrated from JIRA)

I’ve played around with Mike’s patches, but for the AnalyzingSuggester the results have been mixed. Since the transition symbols in the automaton are not closely aligned between the surface and the analyzed form, LookupResult.prefixLength (which attempts to represent the length of the surface string which corresponds to the lookup string) is off quite a bit, leading to very confusing highlighting in non-trivial cases.

I think this is ultimately due to the way how the FST is constructed, but that seems to be non-trivial to change.

In addition, just returning the (surface) prefix length which corresponds to the lookup string is not sufficient for more complex suggesters, such as “infix suggesters” where the user-provided string is not a prefix of the full surface term (google.com: type in “sox rumor”). What the suggesters ultimately would have to return is a list of text chunks where each chunk has a flag whether it’s based on the lookup string or has been auto-completed.

So at this point we are back at trying to identify the matched string portions by other means, which isn’t perfect either, but acceptable in most cases. :(

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK thanks for testing Oliver ... sounds like the approach here is a no-go in practice.

apache / lucene

Suggesters: highlighting (explicit markup of user-typed portions vs. generated portions in a suggestion) [LUCENE-4518] #5584