apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.02k forks source link

Analyzing Suggester [LUCENE-3842] #4915

Closed asfimport closed 12 years ago

asfimport commented 12 years ago

Since we added shortest-path wFSA search in #4788, and generified the comparator in #4874, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching.

In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past <-- byte 0 here is an optional token separator output: surface form such as "the ghost of christmas past" weight: the weight of the suggestion

we make an FST with PairOutputs<weight,output>, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion.

This allows a lot of flexibility:

According to my benchmarks, suggestions are still very fast with the prototype (e.g. \~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc).


Migrated from LUCENE-3842 by Robert Muir (@rmuir), 2 votes, resolved Oct 04 2012 Attachments: LUCENE-3842.patch (versions: 18), LUCENE-3842-TokenStream_to_Automaton.patch Linked issues:

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

messy,dirty,drinking beer on a saturday-afternoon style patch with tons of nocommits.

particularly:

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

also just so i dont forget, we should allow separate specification of "index-time" and "query-time" analyzers... because in cases where you are adding synonyms/wdf/etc there is a tradeoff (bigger FST, versus slower query-time perf)

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Linking to SOLR-2479 since i think this is the most obvious way to support infix suggestions... we already have the analyzer splitting the terms into "words" so its just a matter of indexing tokenstream suffixes if the user wants.

However, I think anything that supports infixing should look at this as word-level edit distance... we should apply some cost penalty to these since they are somewhat of a "reach" for a query suggestion.

I think we want to allow several parameters for that though:

  1. maximum infix length (we should only infix suffixes up to some N)
  2. infix penalty (I think when indexing infixed suffixes we should simply bake a penalty into the FST)
asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

VERY cool :)

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Once posLength is in, I think a very simple way to handle multiple paths at query time is to open up the TopNSearcher class in oal.fst.Util.

Currently the API only allows you to pass in a single starting FST node, but we can easily improve this by adding eg a addStartNode(FST.Arc<T>, int startingCost) instead. This way the app could create a TopNSearcher, add any number of start nodes with the initial path cost, then call .search() to get the best completions.

The only limitation of this is that all differences must be pre-computed as an initial path cost that's "consistent" with how the path costs are accumulated (with the Outputs.add) during searching; I'm not sure if that'd be overly restrictive here?

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch with a static utility method to translate a TokenStream to a byte-by-byte automaton.

It respects posInc and posLength but there are lots of nocommits still!

I think we can use this at index time and maybe query time to graph analyzers... but we still need the "enumerate all paths from this automaton" method...

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Wow, thats awesome Mike... its so small too!

I think we can integrate these two patches into one, with your patch and your addStartNode idea the AnalyzingSuggester should be able to support both index-time and query-time synonyms/wdf/whatever crazy stuff we throw at it :)

but we still need the "enumerate all paths from this automaton" method...

Brics has some code for this (puts all the accepted strings into a set). we should be able to do something similar, to create a set of bytesref?

But really, i'm not sure we need a 'general' method for this. i think we should just have an enumerator for finite automata (e.g. tokenstream) as we can probably make this a 'real' enum rather than creating a massive list/set, we dont need the set deduplication at all either, because its finite.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Patch with a static utility method to translate a TokenStream to a byte-by-byte automaton.

I looked at the patch but I don't fully get what it does. Looks like a combination of state sequence unions, am I right?

Brics has some code for this (puts all the accepted strings into a set).

It's probably a naive walk with an acceptor. I've always wanted to see what Brics returns from that method for an automaton equivalent to .* :)

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I've always wanted to see what Brics returns from that method for an automaton equivalent to .*

Oh, so it is only for finite automata, so it returns null in that case:

http://www.brics.dk/automaton/doc/dk/brics/automaton/SpecialOperations.html#getFiniteStrings%28dk.brics.automaton.Automaton%29

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I looked at the patch but I don't fully get what it does. Looks like a combination of state sequence unions, am I right?

Well the conversion should ultimately be more useful for the suggester to intersect with the FST than a tokenstream, because a tokenstream is like a word-level automaton, if dogs is a synonym for dog, then we have: smelly dog|dogs(positionIncrement=0).

So for our intersection, we would prefer it to be a deterministic at 'character' (byte) level instead. So the conversion should produce an automaton of: smelly dog(s?) in regex notation... this is easier to work with.

at index time its useful too, because in the FST we only care about all the possible byte strings, so this should be easier to enumerate than a tokenstream (especially if you consider multiword synonyms, decompounded terms etc where some span across many).

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Ok, get it, thanks. I wonder if it's always possible, but I bet you can write a random test to ensure that :)

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

It also occurred to me that it would be interesting to have a naive minimalization technique for Brics Automata which would (for automata with a finite language):

While it may seem like an idea with crazy overhead it may actually be a viable alternative to minimization algorithms in Brics for very large automata. Interesting.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Dawid: hmm, well the transitions in brics are ranges in sorted order, so, if its finite, couldn't we just enumerate the language in sorted order while building the minimal automaton incrementally in parallel?

Or am i missing something... its sunday... :)

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Yep, sure you could (I admit I haven't looked at Brics in a long time so I don't remember the details, but I do remember the overhead was significant on optimization; this was a while ago - maybe it's improved).

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

updated patch, tying in Mike's patch too.

Currently my silly test fails because it trips mike's assert. it starts with a stopword :)

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I also don't think we really need this generic getFiniteStrings. its just to get it off the ground. we can just write the possibilities on the fly i think and it will be simpler...

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

merged the patch up to trunk. But it still trips the assert in tokenStreamToAutomaton because the test begins with a stopword.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

i see the problem. it actually happens on the second term (we have ghost/2 christmas/2). The problem is it tries to find the last state to connect the new node, but it uses a hashmap based on position for that... so if there are holes this returns null.

I think for this code we would add nodes for holes (text=POS_SEP) to simplify the logic?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

updated patch: i fixed the bug in tokenStreamToAutomaton (just use lastEndPos instead)

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch, fixing TS2A to insert holes ... this is causing the AnalyzingCompletionTest.testStandard to fail... we have to fix its query-time to insert holes too...

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

testStandard is also bogus: it has 2 asserts. the first one should pass, but the second one should really only work if you disable positionincrements in the (mock) stopfilter.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch, getting us closer ... I opened up Util.shortestPaths so that you can make a TopNSearcher and seed multiple start nodes into its queue... and I created an intersectPaths method to intersect an automaton with an FST, gathering the end nodes and the accumulated outputs. Then I fixed lookup to use these two to enumerate and complete the paths.

The first test in testStandard now passes, but not the 2nd one (I haven't tried disabling posincs in the StopFilter yet).

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK, when I pass false for enablePositionIncrements to MockAnalyzer in testStandard then both cases pass... and I added a 3rd case testing for "ghost chris" and it also passes... cool!

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch, fixing some nocommits, and a separate bug where WFSTSuggester can return a dup suggestion. Still more to do...

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Nice catch on the exactFirst dup problem!

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

When running the benchmark (LookupBenchmarkTest) i noticed the FST size has increased since the original patch. I wonder why this is? the benchmark uses KeywordAnalyzer... it could be (likely even) that the original patch had a bug and now its correct, but maybe its worth investigation...

asfimport commented 12 years ago

Sudarshan Gaikaiwari (migrated from JIRA)

I was not able to apply the latest patch cleanly

smg@dev21:\/lucene_trunk$ patch -p0 < \/LUCENE-3842.patch patching file lucene/test-framework/src/java/org/apache/lucene/util/RollingBuffer.java patching file lucene/core/src/java/org/apache/lucene/util/automaton/SpecialOperations.java patching file lucene/core/src/java/org/apache/lucene/util/RollingBuffer.java Hunk #1 FAILED at 109. 1 out of 1 hunk FAILED – saving rejects to file lucene/core/src/java/org/apache/lucene/util/RollingBuffer.java.rej patching file lucene/core/src/java/org/apache/lucene/util/fst/Util.java patching file lucene/core/src/java/org/apache/lucene/analysis/TokenStreamToAutomaton.java patching file lucene/core/src/test/org/apache/lucene/analysis/TestGraphTokenizers.java patching file lucene/core/src/test/org/apache/lucene/util/automaton/TestSpecialOperations.java patching file lucene/suggest/src/test/org/apache/lucene/search/suggest/analyzing/AnalyzingCompletionTest.java patching file lucene/suggest/src/test/org/apache/lucene/search/suggest/LookupBenchmarkTest.java patching file lucene/suggest/src/java/org/apache/lucene/search/suggest/analyzing/AnalyzingCompletionLookup.java patching file lucene/suggest/src/java/org/apache/lucene/search/suggest/analyzing/FSTUtil.java

I needed to copy RollingBuffer.java to from test-framework to core for the patch to apply cleanly.

cp lucene/test-framework/src/java/org/apache/lucene/util/RollingBuffer.java lucene/core/src/java/org/apache/lucene/util/

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Hi Sudarshan, sorry that was my bad: I had svn mv'd RollingBuffer but when I created the patch I failed to pass --show-copies-as-adds to svn ... so you have to do that mv yourself before applying the patch...

asfimport commented 12 years ago

Sudarshan Gaikaiwari (migrated from JIRA)

Hi Michael,

Thanks a lot for opening up Util.shortestPaths, now that I can seed the queue with the intial nodes using addStartPaths the performance of the GeoSpatialSuggest that I presented at Lucene Revolution has been improved by 2x.

While migrating my code to use this patch, I noticed that I would hit the following assertion in addIfCompetitive.

          path.input.length--;
          assert cmp != 0;
          if (cmp < 0) {

This assert fires when it is not possible to differentiate between the path that we are trying to add to the queue and the bottom. This happens because the different paths that lead to FST nodes during the automata FST intersection are not stored. So the inputpath used to differentiate path contains only the characters that have been consumed from one of the initial FST nodes.

From your comments for the addStartPaths method I think that you have foreseen this problem.

    // nocommit this should also take the starting
    // weight...?

    /** Adds all leaving arcs, including 'finished' arc, if
     *  the node is final, from this node into the queue.  */
    public void addStartPaths(FST.Arc<T> node, T startOutput, boolean allowEmptyString) throws IOException {

Here is a unit test that causes the assert to be triggered.

  public void testInputPathRequired() throws Exception {
    TermFreq keys[] = new TermFreq[] {
        new TermFreq("fast ghost", 50),
        new TermFreq("quick gazelle", 50),
        new TermFreq("fast ghoul", 50),
        new TermFreq("fast gizzard", 50),
    };

    SynonymMap.Builder b = new SynonymMap.Builder(false);
    b.add(new CharsRef("fast"), new CharsRef("quick"), true);
    final SynonymMap map = b.build();

    final Analyzer analyzer = new Analyzer() {
      `@Override`
      protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
        Tokenizer tokenizer = new MockTokenizer(reader, MockTokenizer.SIMPLE, true);
        TokenStream stream = new SynonymFilter(tokenizer, map, true);
        return new TokenStreamComponents(tokenizer, new RemoveDuplicatesTokenFilter(stream));
      }
    };
    AnalyzingCompletionLookup suggester = new AnalyzingCompletionLookup(analyzer);
    suggester.build(new TermFreqArrayIterator(keys));
    List<LookupResult> results = suggester.lookup("fast g", false, 2);
  }

Please let me know if the above analysis looks correct to you and I will start trying to fix this by storing paths during the FST automata intersection.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Hi Sudarshan, thanks for raising this ... I'll have a look...

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Please let me know if the above analysis looks correct to you

That sounds right to me: when we do the initial intersection of the suggest FST with A (from the analyzer), it looks like we must keep the full input path (BytesRef) and carry that over when we pass the node to addStartPaths.

Either that, or ... we change how we break ties in the scores? EG maybe we can tie-break instead by the surface form?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

no changes: just syncing up with trunk

asfimport commented 12 years ago

Sudarshan Gaikaiwari (migrated from JIRA)

maybe we can tie-break instead by the surface form?

The FST construction guarantees that the input paths leading to different nodes are unique, while I don't think we have such a guarantee about the surface form.

I have attached a patch that modifies the intersectPrefixPaths method to keep track of the input paths as the automaton and FST are intersected. Please let me know if that look ok to you?

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Sudarshan those changes look great! You now also record the input for every Path coming back from intersectPrefixPaths, and pass that to addStartPaths. Sorry it took so long to have a look!

I started from your patch, got up to trunk again (there was one compilation error I think), added some comments, downgraded a nocommit.

I think we are close here ... I'll make a branch so we can iterate.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK I created branch: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene3842

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks for resurrecting this from the dead! I had forgotten just how fun this issue was :)

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch, addressing a few nocommits and getting PRESERVE_HOLES and PRESERVE_SEPS working. I did this by adding options to AnalyzingCompletionLookup, and then post-processing the returned automaton from TokenStreamToAutomaton.

I also added a couple new nocommits ....

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch, removing "preserve holes" option from AnalyzingCompletionLookup: you can simply tell your StopFilter whether or not holes are meaningful.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

+1 for that. lets keep this as simple as possible and leave the responsibility to the analyzer as much as possible.

My main concern for the PRESERVE_SEPS was for the japanese use case: we don't much care what the actual tokenization of the japanese words was, only the concatenated reading string. If the tokenization is a little off but the concatenation of all the readings is still correct, then we are ok. So it makes it more robust against tokenization differences, especially considering its partial inputs going into this thing (not whole words)

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch, getting "duplicates" (different surface forms analyzing to same analyzed bytes) fully working, with exactFirst.

I did this by having the FST handle multiple outputs for a single input (this is something it can do in general but we don't often use it)...

But ... I don't really like the approach: it makes the path/results handling more complex because now a single "path" can correspond to multiple results.

I think it would be better/cleaner to append unique (disambiguating) bytes to the end of the analyzed bytes (this was Robert's original idea): then each path is a single result. The only downside I can think of is we will have to reserve a byte (0xFF?), ie we'd append 0xFF 0x00, then 0xFF 0x01 to the next duplicate, ... but since these input BytesRefs are "typically" UTF-8 ... this seems not so bad? Then can of course in general be arbitrary bytes since they are produced by the analysis process...

asfimport commented 12 years ago

Bill Bell (migrated from JIRA)

A common Suggester use case is to boost the results by closest (auto suggest the whole USA but boost the results in the suggester by geodistance). WOuld love to get faster response with that. At the Lucene Revolution 2012 in Boston a speaker did discuss using WFST to do this, but I have yet to figure out how to do it).

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Bill can you start a different thread for that. Its unrelated to this issue.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think it would be better/cleaner to append unique (disambiguating) bytes to the end of the analyzed bytes (this was Robert's original idea): then each path is a single result. The only downside I can think of is we will have to reserve a byte (0xFF?), ie we'd append 0xFF 0x00, then 0xFF 0x01 to the next duplicate, ... but since these input BytesRefs are "typically" UTF-8 ... this seems not so bad? Then can of course in general be arbitrary bytes since they are produced by the analysis process...

I don't understand why we have to reserve any bytes. We can append arbitrary bytes of any sort to the end of the input side, this will have no effect on the actual surface form that we suggest.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

and as far as exactFirst: lets just keep it simple and have it as a surface form comparison?

This is really what I think most people will expect anyway: in the case of "duplicates" and exactFirst, nobody really cares nor sees the underlying analyzed form.

So I don't think we should have multiple outputs for the FST here.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Sorry mike, I'm still catching up and sadly just creating a lot of noise and thinking out loud.

I think you actually have a point with reserving a byte when there are duplicates :)

But at the same time i still think the surface form is valuable for this option too... when we reserve a byte can we also sort the duplicate outputs up front, in such a way that we can start traversing the output side to look for an exactly-matching surface form?

So its like within that 'subtree' of the FST (the duplicates for the exact input), we can binary search?

Otherwise exactFirst would be inefficient in some cases (as we have to do a special 'walk' here). Christian Moen showed me some scary stuff on his Japanese phone as far as readings->kanji forms ... i think if we can it might be good to be cautious and keep this fast...

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch, going back to deduping on the input side... it's not done yet: I think we need to escape the bytes we steal.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch, with escaping working, and exactFirst now means exact surface-form. I also made ctor parameters maxSurfaceFormsPerAnalyzedForm and maxGraphExpansions.

I did the simple linear scan for now ... I think that's fine to commit ("N" is bounded). We may be able to optimize later with something similar to getByOutput...

I think this patch is ready to commit to the branch ... and I think we are close overall to landing ... still a number of nocommits to resolve though.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Applyable patch from the branch (you just have to remove the 0-byte RollingBuffer.java in test-framework after applying it).

I think it's ready ...

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

in TStoA:

if (pos == -1 && posInc == 0) {
  // TODO: hmm are TS's still allowed to do this...?
  posInc = 1;
}

NO they are not! :)

As far as the limitations, i feel like if the last token's endOffset != length of input that might be pretty safe in general (e.g. standardtokenizer) because of how unicode works... i have to think about it.

strange the FST size increased so much? If i run the benchmark:

[junit4:junit4]   2> -- RAM consumption
[junit4:junit4]   2> JaspellLookup   size[B]:    9,815,152
[junit4:junit4]   2> TSTLookup       size[B]:    9,858,792
[junit4:junit4]   2> FSTCompletionLookup size[B]:      466,520
[junit4:junit4]   2> WFSTCompletionLookup size[B]:      507,640
[junit4:junit4]   2> AnalyzingCompletionLookup size[B]:    1,832,952

I don't know if we should worry about that, but it seems somewhat large for just using KeywordTokenizer.

 * <b>NOTE</b>: Although the {`@link` TermFreqIterator} API specifies
 * floating point weights

Thats obselete. See WFSTSuggester in trunk where I fixed this already.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks Rob, good feedback ... I'll post new patch changing that posInc check to an assert, and removing that obsolete NOTE.

As far as the limitations, i feel like if the last token's endOffset != length of input that might be pretty safe in general (e.g. standardtokenizer) because of how unicode works... i have to think about it.

I think we should try that! This way the suggester can "guess" whether the input text is still inside the last token.

But this won't help the StopFilter case, ie if user types 'a' then StopFilter will still delete it even though the token isn't "done" (ie maybe user intends to type 'apple').

Still it's progress so I think we should try it ...

I'm not sure why FST is so much larger ... the outputs should share very well with KeywordTokenizer ... hmm what weights do we use for the benchmark?