apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.58k stars 1.01k forks source link

add suggester that uses shortest path/wFST instead of buckets [LUCENE-3714] #4788

Closed asfimport closed 12 years ago

asfimport commented 12 years ago

Currently the FST suggester (really an FSA) quantizes weights into buckets (e.g. single byte) and puts them in front of the word. This makes it fast, but you lose granularity in your suggestions.

Lately the question was raised, if you build lucene's FST with positiveintoutputs, does it behave the same as a tropical semiring wFST?

In other words, after completing the word, we instead traverse min(output) at each node to find the 'shortest path' to the best suggestion (with the highest score).

This means we wouldnt need to quantize weights at all and it might make some operations (e.g. adding fuzzy matching etc) a lot easier.

out.png


Migrated from LUCENE-3714 by Robert Muir (@rmuir), resolved Feb 19 2012 Attachments: LUCENE-3714.patch (versions: 6), out.png, TestMe.java Linked issues:

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

patch that Mike and I came up with that finds the minimal output from an arc, and a random test showing it works.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

An problematic example where root arcs, when traversed min-to-max collect outputs, but every outgoing arc only collects a single better suggestion (and should skip possibly lots of other suggestions). This is created by the following input:

aa|N ab|1 ba|N bb|2 ca|N cb|3 ..

collecting the K-th suggestion with the smallest score will require scanning pessimistically all of the arcs. Note that you can put arbitrarily large subtrees on _a|N nodes like:

aaa|N aab|N aac|N

etc.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

The patch works because it finds the first (topmost) suggestion, but collecting suggestions with max-N (or min-N) will require a priority queue so that one knows which next arc to follow next (and this will also require storing partially collected paths for pointers in the fst/queue)?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Not sure it requires one, http://www.cs.nyu.edu/\~mohri/pub/nbest.ps has some solutions.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I'm sure there are solutions to the problem if you change algebra ops – the pq is a naive solutions that would work on top of positive outputs.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Yeah I think we should try that first, and see how it performs.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Dawid, by problematic example, you mean you think this approach is functionally correct but may not perform very well...?

That is definitely the worst-case performance (for either top-1 or top-K on a wFST with simple PQ), but this (number of non-competitive arcs you have to scan and discard) is a constant factor on the overall complexity right?

I think we should at least test the simple PQ on PositiveIntsOutputs wFST and see how it performs in practice. If indeed having everything "in one bucket" is too slow, we could combine the two approaches: still use buckets, but within each bucket we have a wFST (ie, use the "true" score), so we don't actually do any quantizing in the end results. Then bucketing is purely an optimization...

Or, maybe, we could keep one bucket but sort each node's arcs by their output instead of by label. This'd mean the initial lookup-by-prefix gets slower (linear scan instead a bin search, assuming those nodes had array'd arcs), but then producing the top-N is very fast (no wasted arcs need to be scanned). Maybe we could keep the by-label sort for nodes within depth N, and then sort by output beyond that...

Or we could change the outputs algebra so that more "lookahead" is stored in each output so we have more guidance on which arcs are worth pursuing...

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I thought you had a solution that collects top-N, but your patch selects one (best) matching solution only. I don't know how you planned to go around selecting top-N, but in my understanding (at that moment) top-N selection is not going to work via recursive scan because an output at the given level doesn't tell you much about which arcs to follow.

I can see how this can be solved by picking the arc/direction with the "next smallest/largest" output among all arcs traversed so far but this will be more complex and I cannot provide any bounds on how large the queue can be or what the worst case lookup then is. I do have a feeling a degenerate example can be devised, but then I also have a feeling these are uncommon in practice.

Sorting arcs by score doesn't help if you use the pq – you need to add all of them to the pq and then pick the smallest path. In a way it is like what you did, but the pq is maintaining fast access to the next-smaller-cost path. Another feeling is that the PQ can be bound to a maximum size of N? Every arc leads to at least one leaf so while traversing you'd drop those arcs that definitely would have fallen out of the first N smallest/largest weights... Yes, this could work. I'd still try to devise a degenerate example to see what the cost of maintaining the PQ can be.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

If I seem inconsistent above then it's because I don't have ready-to-use answers and I'm sort of thinking out loud :)

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

we could combine the two approaches: still use buckets, but within each bucket we have a wFST (ie, use the "true" score), so we don't actually do any quantizing in the end results. Then bucketing is purely an optimization...

I like this idea!

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

If my feeling is right and the PQ can be kept constant-size then it won't matter much at runtime I think. With realistic data distributions the number of elements to be inserted into the PQ before you reach the top-N will be pretty much the same (?). And the benefit would be a much cleaner traversal (no need to deal with buckets, early termination, etc.).

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch, generalizing to a topN search from the wFST (not just top 1). I fixed the random test to randomly pick a topN and it's passing!

I think the cost is something like O(N * L) + O(N * log(N)), where N is the top N and L is the length of each completion. The N * log(N) is because I use a TreeSet (PQ wasn't enough: I needed pollFirst and pollLast... for 3.x that'll have to be 2 method calls...)... but I suspect in practice it won't dominate, since N is typically smaller than L and the constant in front of that is tiny...

Each path requires a traversal through the FST looking for the arcs with NO_OUTPUT, so if the FST has many non-competitive arcs that will definitely slow it down (by a [possibly large] constant factor). We need to test on a real data set how slow it is....

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

an example showing top-n with score and alpha order, pq (unbounded, but going for bounded should be simple).

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Damn, Again slower than McCandless... I didn't do a bounded queue but it's perfectly possible to do one. My patch just shows the algorithm, didn't check if Mike's version is the same but I suspect it must be close (?).

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Ooh, your patch is doing the same algo as mine (best-first search)... yours is much simpler/smaller though :)

I think your tie break isn't quite right? Because, those two arc.labels you use aren't necessarily "comparable", since the two paths may not be the same length (hmm or share the same "last parent" node)? You need to roll back and do a full compare of the accumulated input labels down that path? I struggled with this too...

I also took advantage of a neat property that every min/+ wFST will have (Robert pointed this out): the first arc (only) will have the min output, and then there must exist a NO_OUTPUT completion path from that arc to some final node. So, I think my patch will visit paths in the same best-first order as yours, but I avoid the push/pop/new object into the queue for each arc traversed by greedily/recursively pursuing the first NO_OUTPUT arc I can find. I only fall back to the queue when I need to find the next top-N path to pursue...

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

You're correct – I should be comparing full paths so far, not the current label. Otherwise I see we're pretty much the same. I like the lower bound cutoff condition. I vaguely understand the NO_OUTPUT optimization. :)

I like it, this indeed is a nice improvement - if somebody wants exact scores, they can have them.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

patch with a prototype suggester.

there are many nocommits: especially the encoding of weights is very inefficient and slows/bloats the fst (but easy for a prototype).

We should make a better Outputs class with the algebra we need (e.g. max) and maybe think about floats in this suggester API (it seems from the API names like TermFreq that these are really frequencies so maybe we should define them as ints? I don't like floats because of all the hassles like NaN/inf)

But even with this, the performance seems promising:

    [junit] ------------- Standard Error -----------------
    [junit] -- prefixes: 100-200, num: 7, onlyMorePopular: true
    [junit] JaspellLookup   queries: 50001, time[ms]: 91 [+- 6.59], \~qps: 547
    [junit] TSTLookup       queries: 50001, time[ms]: 33 [+- 1.55], \~qps: 1532
    [junit] FSTCompletionLookup queries: 50001, time[ms]: 163 [+- 13.04], \~qps: 307
    [junit] WFSTCompletionLookup queries: 50001, time[ms]: 55 [+- 1.50], \~qps: 910
    [junit] -- construction time
    [junit] JaspellLookup   input: 50001, time[ms]: 23 [+- 0.93]
    [junit] TSTLookup       input: 50001, time[ms]: 31 [+- 1.25]
    [junit] FSTCompletionLookup input: 50001, time[ms]: 72 [+- 1.73]
    [junit] WFSTCompletionLookup input: 50001, time[ms]: 69 [+- 2.50]
    [junit] -- RAM consumption
    [junit] JaspellLookup   size[B]:    7,869,415
    [junit] TSTLookup       size[B]:    7,914,524
    [junit] FSTCompletionLookup size[B]:      466,051
    [junit] WFSTCompletionLookup size[B]:      506,662
    [junit] -- prefixes: 6-9, num: 7, onlyMorePopular: true
    [junit] JaspellLookup   queries: 50001, time[ms]: 129 [+- 1.37], \~qps: 389
    [junit] TSTLookup       queries: 50001, time[ms]: 139 [+- 2.30], \~qps: 361
    [junit] FSTCompletionLookup queries: 50001, time[ms]: 177 [+- 1.62], \~qps: 282
    [junit] WFSTCompletionLookup queries: 50001, time[ms]: 85 [+- 12.74], \~qps: 592
    [junit] -- prefixes: 2-4, num: 7, onlyMorePopular: true
    [junit] JaspellLookup   queries: 50001, time[ms]: 363 [+- 16.75], \~qps: 138
    [junit] TSTLookup       queries: 50001, time[ms]: 1112 [+- 22.57], \~qps: 45
    [junit] FSTCompletionLookup queries: 50001, time[ms]: 140 [+- 3.16], \~qps: 356
    [junit] WFSTCompletionLookup queries: 50001, time[ms]: 242 [+- 4.01], \~qps: 207
    [junit] ------------- ---------------- ---------------
asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Didn't look at the patch yet but this looks surprising:

FSTCompletionLookup queries: 50001, time[ms]: 163 [+- 13.04], \~qps: 307
WFSTCompletionLookup queries: 50001, time[ms]: 55 [+- 1.50], \~qps: 910

Why would the previous version be so much slower? With onlyMorePopular=true this should be a single pass. Unless it's setup to bring exact match to the top and then there's a need to check multiple buckets with lower scores – would that be it?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thats the exact match case of the benchmark where it benchmarks the entire word being typed in completely?

I didn't look at it because its not very interesting (TSTLookup looks super-fast here too)

I think this is the interesting part:

    [junit] -- prefixes: 2-4, num: 7, onlyMorePopular: true
    [junit] JaspellLookup   queries: 50001, time[ms]: 363 [+- 16.75], \~qps: 138
    [junit] TSTLookup       queries: 50001, time[ms]: 1112 [+- 22.57], \~qps: 45
    [junit] FSTCompletionLookup queries: 50001, time[ms]: 140 [+- 3.16], \~qps: 356
    [junit] WFSTCompletionLookup queries: 50001, time[ms]: 242 [+- 4.01], \~qps: 207

We could of course toggle 'num' etc and see how this comes out. Anyway I just wanted to make sure the performance was 'in the ballpark' and competitive with the other suggesters, I think for a lot of users thats all that matters, and whether its 400 QPS or 200QPS, probably doesnt matter so much... so I didnt tweak any further.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Awesome! I haven't looked at patch yet... but the comparison is also apples/oranges right?

Because WFSTCompletionLookup does no quantizing (bucketing), ie, it returns the "true" topN suggestions, so it has to do more work to differentiate hits that FSTCompletionLookup considers equal.

I guess we could pre-quantize the weights into the same buckets that FSTCompletionLookup will use, when adding to WFSTCompletionLookup... then the results are comparable.

But I guess we don't need to do this since the results are "good enough"...

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

No, no, Mike – I'm all for going with exact, fine scores. That way even if WFSTCompletionLookup is slower it's still a real-life use case and FSTCompletionLookup (even if faster) will have a ⭐ saying it's not a complete solution.

I like what Robert did (still didn't look at the patch), but I was just wondering why the hell difference for long prefixes. This is not a frequent case in real-life, just curiosity.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Sorry, I'm not proposing we commit pre-quantizing the scores... I'm just saying we'd learn more from the perf test that way. Ie, is WFST slower because 1) it's doing precise scores, or 2) the topN algo is slowish.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think the (existing) benchmark is fair: it just measures how fast each suggester returns the top-N. Of course FSTSuggester buckets/early terminates and thats just a tradeoff it makes (having some impact on scores, possibly even positive).

This is just like benchmarking different relevance ranking algorithms, which also compute the top-N differently...

Keep in mind, I didnt optimize the implementation at all or profile anything. So there could be easy wins here. But at this stage I think we just wanted to know its in the ballpark?

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Looked through the patch. Some comments:

ignore duplicates. I think it is allowed for the iterator to allow duplicates; I'm not sure but this may even be used when bucketing input – the same entry with a different score (before bucketing) may end up identical after sorting.

Yes, this should definitely be an option because if it's an exact match then you'll probably want it on top of the suggestions list, no matter what.

You could also add a generator of the "bad case" that I attached inside TestMe.java – this creates the case when following greedily doesn't yield correct output (requires a pq).

I also checked the benchmark and yes, it uses exactMatchFirst promotion. This clarifies the speed difference for longer prefixes – not enough results are collected in any of the buckets so no early termination occurs and all buckets must be traversed.

I like WFSTCompletionLookup very much, clean and simple.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Dawid thanks for the comments: we can remove that first nocommit and then add an option for the second one...

I think as a step to move this forward we have to fix the output encoding to not be (Integer.MAX_VALUE-weight).

Seems like the best first step is to generify findMinPairs to T and to allow an arbitrary Comparator so that we can muck around with the algebra? I looked at this and it seems possible...

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

also I'm not sure I'm in love with "findMinPairs". Maybe we should call it "shortestPaths" ?

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think the (existing) benchmark is fair: it just measures how fast each suggester returns the top-N. Of course FSTSuggester buckets/early terminates and thats just a tradeoff it makes (having some impact on scores, possibly even positive).

OK, I agree with that: it is a meaningful black-box comparison of two suggester impls.

also I'm not sure I'm in love with "findMinPairs". Maybe we should call it "shortestPaths" ?

+1

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

updated patch nuking a couple nocommits: added the boolean exactMatchFirst (default=on like FSTSuggester), and removed the deduping nocommit.

So the main remaining things are:

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I've been wanting to work on this.. havent found the time.

This just syncs the patch up to trunk's FST api changes.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I played with a lot of variations on this patch:

I think we should keep the code wired to Long for now. according to the benchmark any generification is like a 5-10% overall perf hit, and I don't see a need for anything but Long.

I think as far as representation, we need to integrate the offline sort, find min/max float values and scale to <precision> space, e.g. if precision is 32 then Integer.MAX_VALUE - scaledWeight.

I tried different representations and they just add more complexity (e.g. negative outputs), without saving much space at all. This patch uses Integer precision and is only 10% larger than the previous impl.

We don't even need precision to be configurable really, we could wire it to integers as a start. But maybe later someone could specify it, e.g. if they specified 8 then they basically get the same result as bucketed algorithm today...

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Rounded out this patch: fixed it to use the offline sorting, added a random test (basically a port of the FST shortest-path test, but uses the suggester instead), and added a solr factory and some docs.

I think its ready to go.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I'll commit this soon if there are no objections.

I'll mark it experimental for now when committing.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Committed to trunk: working on the backport now.

I'm gonna measure any perf difference from the pollFirst/pollLast stuff... if there is a slowdown we can deal with that on a separate issue (maybe there is some way to prevent a slowdown, at least if java 6 is actually whats being used)... but maybe it doesn't affect overall suggest speed anyway.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I ran the benchmarks, no measurable difference, makes it easy. Will commit the backport soon after ant test/javadocs with java5 are finished.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Backported to 3.x too, I think we can iterate from here with other issues!