blevesearch / bleve

A modern text/numeric/geo-spatial/vector indexing library for go
Apache License 2.0
10.02k stars 680 forks source link

improve fuzzy search to use levenshtein automaton #112

Open mschoch opened 9 years ago

mschoch commented 9 years ago

See: http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html

Also, according to this post:

https://news.ycombinator.com/item?id=8202764

The author has already written a Go implementation. Unfortunately I have no way to reach this person.

emilgpa commented 9 years ago

I wanted to help in the search for the author of that post and I think it might be this: https://twitter.com/dvirsky https://github.com/dvirsky

I don`t know if the same person, but I think that maybe yes..

mschoch commented 9 years ago

tweet sent

mschoch commented 9 years ago

I had misunderstood the comment, the implementation was of the "pre-deletion algorithm" not the automaton. So, there is still a fun project for someone interested in making this faster.

dvirsky commented 9 years ago

I'm the aforementioned dude. hi.

a. I've been looking for an excuse to write the Levenshtein thing in Go. This might be it :)

b. Have you tried Faroo's approach? I'm not familiar with this project so I'm not sure what's going on inside there.

mschoch commented 9 years ago

I don't see any reason why Faroo's aproach would not work. No idea how it compares performance-wise, but one advantage it would seem to have over the automaton is that it can be updated in place, whereas the automaton has to be rebuilt to accommodate new terms.

The main questions I have are about how the integration would actually work.

  1. Presumably at mapping time (or later) the user has some way to say they want these tables to be generated/maintained for a given field.
  2. If they are in place, fuzzy search can use them, and be faster, if not it can fall back to the slower approach.
  3. Should they be maintained in memory only? or does it make sense to persist them in a separate section of the index. If they are maintained in memory how do we link that to the indexing path so that deleted terms are purged, and new terms are added?
dvirsky commented 9 years ago

Their blog post says they just add pre-deleted records to their index, pointing directly to the right document (and probably with a flag specifying these are "fuzzy" records, and a reduced weight to the record based on the L-distance). It has a huge memory/disk overhead from my experience, but it's dead simple to add this to an inverted index and will cost hardly any CPU time.

dvirsky commented 9 years ago

BTW I'm more interested in using this or the automaton approach for fuzzy autocomplete. i.e if you type "varac" you will get suggestions for "barack obama". I've written an auto-complete engine in the past that uses Norvig's approach, but at ~10 characters it becomes so slow to brute-force candidates and rank them that it was impractical. And this [faroo's] approach seems to be too expensive in terms of RAM.

mschoch commented 9 years ago

So in that case I would think that...

  1. User chooses to enable this on a field at mapping time. 2a. Fuzzed rows are inserted just like normal ones with some flag or 2b. Fuzzed rows are inserted separately
  2. Non-fuzzy searches ignore fuzzed rows (2a) or don't see them at all (2b)
  3. Fuzzy searches use all the rows, or knows how to read the fuzzed rows for non-exact matches

The advantage of this is not a lot of new code paths. Almost everything fits as is. The decision between 2a and 2b to me has to do with just how much term explosion there is. It sounds like its a lot, and I worry that it would slow down the non-fuzzy searches because they'd have to skip over all these uninteresting rows. By going with 2b, we would prefix the rows differently so that non-fuzzy searches never hit that part of the index.

mschoch commented 9 years ago

@dvirsky maybe open a separate issue to discuss auto-complete. its definitely a feature we'd like to add to bleve as well.

dvirsky commented 9 years ago

@mschoch do you have auto-complete implemented in any form?

mschoch commented 9 years ago

Its not exposed right now because of some rough edges, but you can get an iterator for the terms used in a field. So, it would be feasible to offer a term-prefixed based suggestion. My understanding is that while this is somewhat simple to implement its not that useful in practice.

And of course we should probably distinguish between dictionary based suggestions that are based entirely on whats in the index, vs usage based suggestions (like google) which are based on things actually searched for independent of what actually exists.

Its a pretty wide open area of the project, if you want to help shape that part we'd be glad to have contributions.

dvirsky commented 9 years ago

Yeah, naive term-based generates a lot of junk. It might be used as a seed for query-based completion (which needs LOTS of data to not look ridiculous) .

But if you filter terms based on some NLP pre-processing, and perhaps known entity detection and synonym expansion, you will get much much nicer results to boot.

Another problem is composite completion. let's say you have a site with songs and artists. When a user types "madonna like" you'll probably want to complete to "like a virgin". But no user has searched for this before and the term verbatim never appears in the index. It's possible to infer context for completion but not trivial. Google do that (of course...) but they only added that 3-4 years ago IIRC.

I have Go code for a simple prefix/suffix based completion engine with ranking [https://bitbucket.org/dvirsky/boilerdb/src/160b1047f1a73d6f72cd967aeceef055ad7ca77d/src/plugins/prefix_tree/trie.go?at=master], but indeed the problem is how you're going to populate that index.

dvirsky commented 9 years ago

@mschoch I've just stumbled upon this library. I haven't looked to deeply at it but it seems very relevant. https://github.com/smartystreets/mafsa

mschoch commented 9 years ago

Thanks, that does look to be very relevant. If anyone wants to prototype something using this library that would be awesome. It looks pretty high-level and easy to use...

dvirsky commented 9 years ago

The issue here is memory overhead compared to Faroo's approach disk overhead. I'll try to fill it up with a bunch of words to see the memory consumption. It means that working with fuzzy search will require an in-memory side-index. But if you can get spelling corrections out of the box with that it might be worth it.

mschoch commented 9 years ago

The most recent commit https://github.com/blevesearch/bleve/commit/522f9d5cc7cf698cf9e0d288ea4a04b45489d4fd now lets us enumerate the terms used in a field. Further it produces them in ascending order, which is necessary for building many of the data structures like mafsa and other similar ones.

It should now be possible to do some fuzzy matching and suggestions outside of bleve, using the term dictionary found inside of bleve. Tighter integration within bleve remains open for the future.

mschoch commented 9 years ago

Some more links from @dgryski in #bleve

Approximate String Processing http://hadjieleftheriou.com/papers/FnT_HS11.pdf

Damn Cool Algorithms: Levenshtein Automata http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata

mschoch commented 9 years ago

I just came across this blog, with straightforward python code (40 lines or so) claiming to be a reasonably efficient levenshtein automata: http://julesjacobs.github.io/2015/06/17/disqus-levenshtein-simple-and-fast.html

dvirsky commented 9 years ago

Looks cool, seems like porting this to go should be a breeze. I'll try to do it and see how it works, and report back.

dvirsky commented 9 years ago

@mschoch is there a trie representation of the dictionary to be searched in bleve?

Porting the python code would indeed be trivial, but it work well unless we traverse a trie of all the words that exist in the index.

dgryski commented 9 years ago

There is some discussion on how accurate the runtime analysis presented is: https://news.ycombinator.com/item?id=9737554

dvirsky commented 9 years ago

@dgryski haven't read it yet, but for one, the method in this blog post would probably cause tons of GC in go, it constantly dynamically allocates vectors of integers

dgryski commented 9 years ago

Perhaps https://godoc.org/golang.org/x/tools/container/intsets or https://godoc.org/github.com/google/codesearch/sparse would help? (I haven't read it yet either ;)

dvirsky commented 9 years ago

@dgryski the longer version of the algo is a "sparse levenshtein automaton" that drops from the vectors distances greater than the max limit, so it will reduce allocation. The guy implements them as 2 python lists, but maybe an intset would be better. Also pooling those sets might also reduce allocations.

The algo in the post doesn't deal with working alongside a trie, but it should be simple to extend - although you'll need a stack of copied state vectors, which will also cause GC. Anyway, I'll start by porting the basic code and work my way from there. The algorithm itself should be pretty fast from the looks of it.

dvirsky commented 9 years ago

@mschoch @dgryski here's a naive, unoptimized go implementation of the sparse automaton from the blog post. It needs intersection with a trie to be worth anything, I'll try to get to that later or tomorrow.

https://github.com/dvirsky/levenshtein

The package includes a test that doesn't actually test much, but runs the example code from the blog post.

dvirsky commented 9 years ago

Ok, I've implemented matching against a trie, and it works. I haven't benchmarked it against a big dictionary - and the trie will probably take up lots of RAM in its current form.

Example usage:


func ExampleTrie() {

    trie := NewTrie()
    trie.Insert("banana")
    trie.Insert("bananas")
    trie.Insert("cabana")
    trie.Insert("cabasa")

    fmt.Println(trie.FuzzyMatches("banana", 2))
    // Output:
    // [banana bananas cabana]
}
dgryski commented 9 years ago

Using a compressed trie (patricia tree, like https://github.com/armon/go-radix ) should help significantly with the space.

dvirsky commented 9 years ago

yeah, I used the simplest naive trie, just to check the whole flow and correctness of the algorithm. So basically as a POC it works, now it needs some work to be made production grade. I'll work on it some more tonight probably, and test it with a bigger data set :)

BTW this can be pretty easily extended to an autocomplete service with "spell as you type" corrections.

mschoch commented 9 years ago

Awesome progress! Sorry for the late reply, re: trie. Currently bleve doesn't have anything. What we do have is a Field Dictionary Iterator interface which you can use to build the trie with data from the index.

I did a little bit of experimenting with the sajari/fuzzy and smartystreets/mafsa. I think there are still a lot of open questions about how an auto-complete/spell-correct interface for bleve should look. I do like the idea that user can either replace or augment the term dictionary data with other data (sometimes you have better data from search logs and/or want to re-weight things based on other information).

Further, the term dictionary info we get now is a one-time thing. So, if you continue to update your index, your auto-completer/suggester is stale. The data-structures tend to matter a lot here, because many of them are either expensive to create, hard to modify after being built, or both. For a lot of use-cases this is OK, you may not need to immediately see a new term show up in the suggestions, and can tolerate rebuilding the suggestions perodically.

The overall take-away here is that I'm very open to suggestions. And rather than get stuck on coming up with something perfect, I'm also very open to us just plugging something in to see how we like it.

dvirsky commented 9 years ago

@mschoch why not use sajari's package then? it seems pretty good.

also, how do you see the integration of such a feature with bleve? as a separate service? part of the library?

mschoch commented 9 years ago

I'm not opposed to using sajari's package. I actually thought it was just using a map underneath the hood though, and wouldn't yield the benefits of these more advanced data-structures. But, if the recommendation is that this is good and we should use it, I'm fine with that.

As for integration, I still have more questions than answers. I see 3 levels of interest:

  1. The lowest level is separate from the top-level bleve package. This would be for the building blocks, things that you could reuse outside of bleve, but which bleve will use to build on.
  2. Useful functionality exposed at the top-level bleve package. At the simplest level we should make it easy for users to get suggestions from their term dictionary. For more advanced users we should make it possible for advanced users to augment this to provide better suggestions.
  3. At an even higher level there should be some composability. We already support searching across multiple indexes with the index alias feature. This is useful for sharding indexes. But, of course if you shard your data, now your term dictionary can vary across the shards. So, getting suggestions can require similar scatter/gather requests.

Finally, as described so far, all these auto-competion/suggester instances are something you programatically create. But, because they can be expensive to create/update, there is a case to be made for them to have some persistance, possibly along with the index itself (not necessarily in the k/v store, but colocated on the filesystem). At that point, I'm wondering if their existance should somehow be part of the index mapping.

As I said, all of this is open for discussion. So if you have thoughts on any of these topics, please share them.

dvirsky commented 9 years ago

Looking at the sajari code, it looks like they're using Faroo's algorithm (see waaay above in this thread), which is indeed lightning fast (it just saves all possible deletes of each word, and looks them up), but has huge memory overhead, especially with long strings.

They claim to do 40us per search. Using their test dictionary and a more optimized version of my trie lookup, I'm able to do 1-2ms per search on the same vocabulary, depending on word length. BTW for very short words (2-4 chars probably) you can do a single distance lookup which can be x10 faster.

We need to compare RAM and time characteristics for both methods and see which is the right way to optimize.

dvirsky commented 9 years ago

Re the use case - I'm not very familiar with Bleve's architecture, but how about using it for query expansion, not only correction. You can simply search for all the fuzzy matched "relatives" of the query, and perhaps rank the results lower somehow. This can help on non errors as well, i.e. crude stemming (although maybe a 1-distance search should be used)

Shugyousha commented 9 years ago

Sorry if this is old news but at this year's FOSDEM there was an interesting presentation about using Lucenes FSTs (I actually think that is a misnomer and it should be FSA but it's probably not important) for spelling correction.

http://video.fosdem.org/2015/devroom-open_source_search/effective_spelling_correction_with_term_relation_graphs_using_lucene_fsts.mp4

What they do is intersecting a Lucene FST with an automaton built from the user query to get their spelling corrections. If one uses the field terms to build the FST it would be usable for fuzzy search as well if I understand correctly. The whole thing seems to break down if there is a misspelling in the first char of the user query though.

jamra commented 9 years ago

I haven't contributed to this project yet, but am interested. Is this still something that's being worked on?

mschoch commented 9 years ago

Welcome @jamra

There is a lot of interest in this, and a few of us have experimented with wiring up various implementations. I still have questions about the best way to wire it up to the rest of Bleve. I think there isn't a lot of consensus here yet, as it seems like we all want to accomplish something slightly different.

In summary, I'd say if you're interested, there is still work to be done :)

jamra commented 9 years ago

@mschoch Thanks! :)

I will need to get more situated with how Bleve does its indexing, but I like the material that was linked by this issue. One of the situations brought up in the video linked by Shugyousha is an example of having 50 edit-distance-of-one search results, but only want to return 25 or so. Is there any consideration of matching by multiple words (I think the video calls it biword matching)?

I implemented a few fuzzy search algorithms, which have very different :

I would like to assist with this issue in particular if someone can perhaps lay down some guidelines in terms of respecting the dynamically changing indices and ranking results.

mschoch commented 9 years ago

@jamra my hope was that eventually we would have a well defined interface, then you could plug in the best implementation for a given use case. But, we've kinda hesitated on this issue for a long time. I'd love to see us make progress without worrying about solving everything at once.

What I'd love to see is a proposal for how someone thinks the API should look. The core use case is someone who has already loaded their data into Bleve, and now wants to get completions/suggestions from it. I realize people want to do a lot more than just that, but I think its a starting point.

jamra commented 9 years ago

@mschoch Without creating a special index for fuzzy matching, we would have to traverse the entire list of words. I tried to search through the codebase to see how indexing works, but it's taking me some time to understand the code. Is there any documentation on indexing?

mschoch commented 9 years ago

Hmmn, maybe there is still some disconnect about what we're talking about then.

My thought was that the suggester/completer was an in-memory structure built for the purpose of efficiently answering these fuzzy/incomplete queries. So, I assumed they start with inefficient, lists of words, and build the fast-to-query indexes. Some of these may also be able to be persisted to disk, for easy reuse upon process restart, but some may not support that and may have to be rebuilt.

I'm not quite sure what kind of documentation you're looking for. Is there some particular aspect of indexing you're trying to understand?

jamra commented 9 years ago

No misunderstanding. That's what I'm looking to do as well. I think I'm using the word indexing incorrectly. I was trying to figure out how to get access to the list of words in order to build the in-memory structure.

I suppose there are two stages to this in-memory structure. Building the structure and then querying the structure.

mschoch commented 9 years ago

So, the methods you're interested in are:

https://github.com/blevesearch/bleve/blob/master/index.go#L98-L100

They return an enumerator-like object called a FieldDict, which returns DictEntry objects, in order.

https://github.com/blevesearch/bleve/blob/master/index/index.go#L86-L94

dvirsky commented 9 years ago

Another take on Levenshtein automata: http://fulmicoton.com/posts/levenshtein/

mschoch commented 8 years ago

Adding another reference about these type of automatons, this one is implemented in Rust, but the blog post includes a lot of useful details: http://blog.burntsushi.net/transducers/

Shugyousha commented 8 years ago

To come back to this topic...

On Mon, Jun 22, 2015 at 01:26:35PM -0700, Marty Schoch wrote:

I'm not opposed to using sajari's package. I actually thought it was just using a map underneath the hood though, and wouldn't yield the benefits of these more advanced data-structures. But, if the recommendation is that this is good and we should use it, I'm fine with that.

As for integration, I still have more questions than answers. I see 3 levels of interest:

  1. The lowest level is separate from the top-level bleve package. This would be for the building blocks, things that you could reuse outside of bleve, but which bleve will use to build on.

I have combined the levenshtein automaton implemented in https://github.com/dvirsky/levenshtein package with https://github.com/smartystreets/mafsa as a Trie alternative. The result has been merged into dvirsky's code (the structure is called MinTree) and it should theoretically provide better memory usage than the naive Trie he started out with (though I have not benchmarked it).

  1. Useful functionality exposed at the top-level bleve package. At the simplest level we should make it easy for users to get suggestions from their term dictionary. For more advanced users we should make it possible for advanced users to augment this to provide better suggestions.

I have added a FuzzySearch implementation that uses the MinTree structure mentioned above to bleve as a proof-of-concept (POC) and added a benchmark. The benchmark shows a speedup of about 20x compared to the current implementation but this does not include the time needed to populate the MinTree structure with the terms of the FieldDict. The code can be found here.

https://github.com/Shugyousha/bleve/tree/mafsafuzzy

The POC uses the upside_down.IndexReader to expose the functionality and the FieldCache to store the MinTrees built for each field. The following questions/issues remain and should be discussed.

  1. We should probably use some sort of "Suggester" interface that has a GetSuggestions(term string, fuzziness int) []string (or similar) method. The MinTree could be easily extended to implement the interface and other implementations could be used as well (depending on the use case).
  2. Use of the infrastructure for suggestions: Exposing the MinTree/Suggester as a way to get suggestions. Maybe a GetSuggester() Suggester method on the IndexReader would be a simple start? People that want to try the functionality could then use the Advanced() method to get access to the IndexReader and its Suggester. We could hook the functionality up in a HTTP handler in http/ as well.
  3. Building/Updating the MinTree/Suggester: In the POC I just build the MinTree on demand (trying to guard it from concurrency issues using a channel, an approach which I have not tested yet and is probably buggy) using the BuildMinTree(fieldname string) method.
  4. Persisting the MinTree/Suggester: The MinTree structure can be serialized into a binary form and read from disk again though I have not implemented this in the POC. Theoretically it should be easy to do though and adding a Save(whereto io.Writer)/Load(wherefrom io.Reader) or similar method to the Suggester interface should be feasible as well.
  5. The IndexReader may not be the best place to put the MinTree/Suggester and the FieldCache may not be the right place to cache them either. It's just the first place that came to mind when I was trying to make it work...
  1. At an even higher level there should be some composability. We already support searching across multiple indexes with the index alias feature. This is useful for sharding indexes. But, of course if you shard your data, now your term dictionary can vary across the shards. So, getting suggestions can require similar scatter/gather requests.

I have not thought about this at all yet...

Cheers,

Silvan

Finally, as described so far, all these auto-competion/suggester instances are something you programatically create. But, because they can be expensive to create/update, there is a case to be made for them to have some persistance, possibly along with the index itself (not necessarily in the k/v store, but colocated on the filesystem). At that point, I'm wondering if their existance should somehow be part of the index mapping.

As I said, all of this is open for discussion. So if you have thoughts on any of these topics, please share them.

jamra commented 8 years ago

I like the Rust implementation linked by @mschoch

I went through the source code, but my Rust knowledge is non existent. He links to two research papers explaining how to build an FSA and then an FST. I started the first paper, but am too busy to really get into it. The requirements are to insert terms in lexicographic order. The index is compressed wonderfully and his implementation can store the index on disk or in memory (which we should too). Another benefit is we can use that same index to match regex queries. The key is intersecting automatons.

I peaked into Bleve's source code and I see that we have only one implementation of an index. I suppose we could use the new method offered by @Shugyousha but some of the literature states that a distance of 2 is optimal both in accuracy and in creation time.

If someone could assist me with the research papers and figure out how to represent the FST efficiently, I would be able to help out. Otherwise, I'll have to wait for my day job to get less busy :P

FSA paper FST paper

Shugyousha commented 8 years ago

On Tue, Mar 29, 2016 at 01:30:45PM -0700, Jacob Amrany wrote:

I like the Rust implementation linked by @mschoch

I went through the source code, but my Rust knowledge is non existent. He links to two research papers explaining how to build an FSA and then an FST. I started the first paper, but am too busy to really get into it. The requirements are to insert terms in lexicographic order. The index is compressed wonderfully and his implementation can store the index on disk or in memory (which we should too). Another benefit is we can use that same index to match regex queries. The key is intersecting automatons.

The mafsa package I used in the POC implements a "minimal acyclic finite state automaton" already and then uses a levenshtein automaton based on sparse vectors to generate the fuzzy strings. I think the sparse vector approach used in the levenshtein automaton is similar in behavior and performance to doing an FSA intersection but I am not 100% sure.

I peaked into Bleve's source code and I see that we have only one implementation of an index. I suppose we could use the new method offered by @Shugyousha but some of the literature states that a distance of 2 is optimal both in accuracy and in creation time.

I use a distance of 2 in my POC but making the distance variable should not have a disadvantage.

If someone could assist me with the research papers and figure out how to represent the FST efficiently,

The mafsa package at https://github.com/smartystreets/mafsa implements an FSA. I don't know how efficient it is though (it includes a functionality to serialize the mafsa into a binary representation as well).

jamra commented 8 years ago

Thanks @Shugyousha

When I get home tonight, I will play with the mafsa implementation. It's better to make something work and then optimize the index. Your idea with a variable distance search may need to happen during creation time since the automaton changes its size based on the distance.

dvirsky commented 8 years ago

Hi. If anyone is interested - I've ported my implementation to C and optimized it considerably (converted the simple automaton into a DFA which works very fast), as part of a search engine project built on top of Redis. https://github.com/RedisLabsModules/RediSearch/tree/master/src/trie

Also the trie implementation itself is much more memory efficient but way more complex. Porting it back to Go however should not be difficult, just tedious.

valsor commented 6 years ago

@dvirsky Do you have any plans to Port it to Go?