Open mschoch opened 10 years ago
I wanted to help in the search for the author of that post and I think it might be this: https://twitter.com/dvirsky https://github.com/dvirsky
I don`t know if the same person, but I think that maybe yes..
tweet sent
I had misunderstood the comment, the implementation was of the "pre-deletion algorithm" not the automaton. So, there is still a fun project for someone interested in making this faster.
I'm the aforementioned dude. hi.
a. I've been looking for an excuse to write the Levenshtein thing in Go. This might be it :)
b. Have you tried Faroo's approach? I'm not familiar with this project so I'm not sure what's going on inside there.
I don't see any reason why Faroo's aproach would not work. No idea how it compares performance-wise, but one advantage it would seem to have over the automaton is that it can be updated in place, whereas the automaton has to be rebuilt to accommodate new terms.
The main questions I have are about how the integration would actually work.
Their blog post says they just add pre-deleted records to their index, pointing directly to the right document (and probably with a flag specifying these are "fuzzy" records, and a reduced weight to the record based on the L-distance). It has a huge memory/disk overhead from my experience, but it's dead simple to add this to an inverted index and will cost hardly any CPU time.
BTW I'm more interested in using this or the automaton approach for fuzzy autocomplete. i.e if you type "varac" you will get suggestions for "barack obama". I've written an auto-complete engine in the past that uses Norvig's approach, but at ~10 characters it becomes so slow to brute-force candidates and rank them that it was impractical. And this [faroo's] approach seems to be too expensive in terms of RAM.
So in that case I would think that...
The advantage of this is not a lot of new code paths. Almost everything fits as is. The decision between 2a and 2b to me has to do with just how much term explosion there is. It sounds like its a lot, and I worry that it would slow down the non-fuzzy searches because they'd have to skip over all these uninteresting rows. By going with 2b, we would prefix the rows differently so that non-fuzzy searches never hit that part of the index.
@dvirsky maybe open a separate issue to discuss auto-complete. its definitely a feature we'd like to add to bleve as well.
@mschoch do you have auto-complete implemented in any form?
Its not exposed right now because of some rough edges, but you can get an iterator for the terms used in a field. So, it would be feasible to offer a term-prefixed based suggestion. My understanding is that while this is somewhat simple to implement its not that useful in practice.
And of course we should probably distinguish between dictionary based suggestions that are based entirely on whats in the index, vs usage based suggestions (like google) which are based on things actually searched for independent of what actually exists.
Its a pretty wide open area of the project, if you want to help shape that part we'd be glad to have contributions.
Yeah, naive term-based generates a lot of junk. It might be used as a seed for query-based completion (which needs LOTS of data to not look ridiculous) .
But if you filter terms based on some NLP pre-processing, and perhaps known entity detection and synonym expansion, you will get much much nicer results to boot.
Another problem is composite completion. let's say you have a site with songs and artists. When a user types "madonna like" you'll probably want to complete to "like a virgin". But no user has searched for this before and the term verbatim never appears in the index. It's possible to infer context for completion but not trivial. Google do that (of course...) but they only added that 3-4 years ago IIRC.
I have Go code for a simple prefix/suffix based completion engine with ranking [https://bitbucket.org/dvirsky/boilerdb/src/160b1047f1a73d6f72cd967aeceef055ad7ca77d/src/plugins/prefix_tree/trie.go?at=master], but indeed the problem is how you're going to populate that index.
@mschoch I've just stumbled upon this library. I haven't looked to deeply at it but it seems very relevant. https://github.com/smartystreets/mafsa
Thanks, that does look to be very relevant. If anyone wants to prototype something using this library that would be awesome. It looks pretty high-level and easy to use...
The issue here is memory overhead compared to Faroo's approach disk overhead. I'll try to fill it up with a bunch of words to see the memory consumption. It means that working with fuzzy search will require an in-memory side-index. But if you can get spelling corrections out of the box with that it might be worth it.
The most recent commit https://github.com/blevesearch/bleve/commit/522f9d5cc7cf698cf9e0d288ea4a04b45489d4fd now lets us enumerate the terms used in a field. Further it produces them in ascending order, which is necessary for building many of the data structures like mafsa and other similar ones.
It should now be possible to do some fuzzy matching and suggestions outside of bleve, using the term dictionary found inside of bleve. Tighter integration within bleve remains open for the future.
Some more links from @dgryski in #bleve
Approximate String Processing http://hadjieleftheriou.com/papers/FnT_HS11.pdf
Damn Cool Algorithms: Levenshtein Automata http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata
I just came across this blog, with straightforward python code (40 lines or so) claiming to be a reasonably efficient levenshtein automata: http://julesjacobs.github.io/2015/06/17/disqus-levenshtein-simple-and-fast.html
Looks cool, seems like porting this to go should be a breeze. I'll try to do it and see how it works, and report back.
@mschoch is there a trie representation of the dictionary to be searched in bleve?
Porting the python code would indeed be trivial, but it work well unless we traverse a trie of all the words that exist in the index.
There is some discussion on how accurate the runtime analysis presented is: https://news.ycombinator.com/item?id=9737554
@dgryski haven't read it yet, but for one, the method in this blog post would probably cause tons of GC in go, it constantly dynamically allocates vectors of integers
Perhaps https://godoc.org/golang.org/x/tools/container/intsets or https://godoc.org/github.com/google/codesearch/sparse would help? (I haven't read it yet either ;)
@dgryski the longer version of the algo is a "sparse levenshtein automaton" that drops from the vectors distances greater than the max limit, so it will reduce allocation. The guy implements them as 2 python lists, but maybe an intset would be better. Also pooling those sets might also reduce allocations.
The algo in the post doesn't deal with working alongside a trie, but it should be simple to extend - although you'll need a stack of copied state vectors, which will also cause GC. Anyway, I'll start by porting the basic code and work my way from there. The algorithm itself should be pretty fast from the looks of it.
@mschoch @dgryski here's a naive, unoptimized go implementation of the sparse automaton from the blog post. It needs intersection with a trie to be worth anything, I'll try to get to that later or tomorrow.
https://github.com/dvirsky/levenshtein
The package includes a test that doesn't actually test much, but runs the example code from the blog post.
Ok, I've implemented matching against a trie, and it works. I haven't benchmarked it against a big dictionary - and the trie will probably take up lots of RAM in its current form.
Example usage:
func ExampleTrie() {
trie := NewTrie()
trie.Insert("banana")
trie.Insert("bananas")
trie.Insert("cabana")
trie.Insert("cabasa")
fmt.Println(trie.FuzzyMatches("banana", 2))
// Output:
// [banana bananas cabana]
}
Using a compressed trie (patricia tree, like https://github.com/armon/go-radix ) should help significantly with the space.
yeah, I used the simplest naive trie, just to check the whole flow and correctness of the algorithm. So basically as a POC it works, now it needs some work to be made production grade. I'll work on it some more tonight probably, and test it with a bigger data set :)
BTW this can be pretty easily extended to an autocomplete service with "spell as you type" corrections.
Awesome progress! Sorry for the late reply, re: trie. Currently bleve doesn't have anything. What we do have is a Field Dictionary Iterator interface which you can use to build the trie with data from the index.
I did a little bit of experimenting with the sajari/fuzzy and smartystreets/mafsa. I think there are still a lot of open questions about how an auto-complete/spell-correct interface for bleve should look. I do like the idea that user can either replace or augment the term dictionary data with other data (sometimes you have better data from search logs and/or want to re-weight things based on other information).
Further, the term dictionary info we get now is a one-time thing. So, if you continue to update your index, your auto-completer/suggester is stale. The data-structures tend to matter a lot here, because many of them are either expensive to create, hard to modify after being built, or both. For a lot of use-cases this is OK, you may not need to immediately see a new term show up in the suggestions, and can tolerate rebuilding the suggestions perodically.
The overall take-away here is that I'm very open to suggestions. And rather than get stuck on coming up with something perfect, I'm also very open to us just plugging something in to see how we like it.
@mschoch why not use sajari's package then? it seems pretty good.
also, how do you see the integration of such a feature with bleve? as a separate service? part of the library?
I'm not opposed to using sajari's package. I actually thought it was just using a map underneath the hood though, and wouldn't yield the benefits of these more advanced data-structures. But, if the recommendation is that this is good and we should use it, I'm fine with that.
As for integration, I still have more questions than answers. I see 3 levels of interest:
Finally, as described so far, all these auto-competion/suggester instances are something you programatically create. But, because they can be expensive to create/update, there is a case to be made for them to have some persistance, possibly along with the index itself (not necessarily in the k/v store, but colocated on the filesystem). At that point, I'm wondering if their existance should somehow be part of the index mapping.
As I said, all of this is open for discussion. So if you have thoughts on any of these topics, please share them.
Looking at the sajari code, it looks like they're using Faroo's algorithm (see waaay above in this thread), which is indeed lightning fast (it just saves all possible deletes of each word, and looks them up), but has huge memory overhead, especially with long strings.
They claim to do 40us per search. Using their test dictionary and a more optimized version of my trie lookup, I'm able to do 1-2ms per search on the same vocabulary, depending on word length. BTW for very short words (2-4 chars probably) you can do a single distance lookup which can be x10 faster.
We need to compare RAM and time characteristics for both methods and see which is the right way to optimize.
Re the use case - I'm not very familiar with Bleve's architecture, but how about using it for query expansion, not only correction. You can simply search for all the fuzzy matched "relatives" of the query, and perhaps rank the results lower somehow. This can help on non errors as well, i.e. crude stemming (although maybe a 1-distance search should be used)
Sorry if this is old news but at this year's FOSDEM there was an interesting presentation about using Lucenes FSTs (I actually think that is a misnomer and it should be FSA but it's probably not important) for spelling correction.
What they do is intersecting a Lucene FST with an automaton built from the user query to get their spelling corrections. If one uses the field terms to build the FST it would be usable for fuzzy search as well if I understand correctly. The whole thing seems to break down if there is a misspelling in the first char of the user query though.
I haven't contributed to this project yet, but am interested. Is this still something that's being worked on?
Welcome @jamra
There is a lot of interest in this, and a few of us have experimented with wiring up various implementations. I still have questions about the best way to wire it up to the rest of Bleve. I think there isn't a lot of consensus here yet, as it seems like we all want to accomplish something slightly different.
In summary, I'd say if you're interested, there is still work to be done :)
@mschoch Thanks! :)
I will need to get more situated with how Bleve does its indexing, but I like the material that was linked by this issue. One of the situations brought up in the video linked by Shugyousha is an example of having 50 edit-distance-of-one search results, but only want to return 25 or so. Is there any consideration of matching by multiple words (I think the video calls it biword matching)?
I implemented a few fuzzy search algorithms, which have very different :
I would like to assist with this issue in particular if someone can perhaps lay down some guidelines in terms of respecting the dynamically changing indices and ranking results.
@jamra my hope was that eventually we would have a well defined interface, then you could plug in the best implementation for a given use case. But, we've kinda hesitated on this issue for a long time. I'd love to see us make progress without worrying about solving everything at once.
What I'd love to see is a proposal for how someone thinks the API should look. The core use case is someone who has already loaded their data into Bleve, and now wants to get completions/suggestions from it. I realize people want to do a lot more than just that, but I think its a starting point.
@mschoch Without creating a special index for fuzzy matching, we would have to traverse the entire list of words. I tried to search through the codebase to see how indexing works, but it's taking me some time to understand the code. Is there any documentation on indexing?
Hmmn, maybe there is still some disconnect about what we're talking about then.
My thought was that the suggester/completer was an in-memory structure built for the purpose of efficiently answering these fuzzy/incomplete queries. So, I assumed they start with inefficient, lists of words, and build the fast-to-query indexes. Some of these may also be able to be persisted to disk, for easy reuse upon process restart, but some may not support that and may have to be rebuilt.
I'm not quite sure what kind of documentation you're looking for. Is there some particular aspect of indexing you're trying to understand?
No misunderstanding. That's what I'm looking to do as well. I think I'm using the word indexing incorrectly. I was trying to figure out how to get access to the list of words in order to build the in-memory structure.
I suppose there are two stages to this in-memory structure. Building the structure and then querying the structure.
So, the methods you're interested in are:
https://github.com/blevesearch/bleve/blob/master/index.go#L98-L100
They return an enumerator-like object called a FieldDict, which returns DictEntry objects, in order.
https://github.com/blevesearch/bleve/blob/master/index/index.go#L86-L94
Another take on Levenshtein automata: http://fulmicoton.com/posts/levenshtein/
Adding another reference about these type of automatons, this one is implemented in Rust, but the blog post includes a lot of useful details: http://blog.burntsushi.net/transducers/
To come back to this topic...
On Mon, Jun 22, 2015 at 01:26:35PM -0700, Marty Schoch wrote:
I'm not opposed to using sajari's package. I actually thought it was just using a map underneath the hood though, and wouldn't yield the benefits of these more advanced data-structures. But, if the recommendation is that this is good and we should use it, I'm fine with that.
As for integration, I still have more questions than answers. I see 3 levels of interest:
- The lowest level is separate from the top-level bleve package. This would be for the building blocks, things that you could reuse outside of bleve, but which bleve will use to build on.
I have combined the levenshtein automaton implemented in https://github.com/dvirsky/levenshtein package with https://github.com/smartystreets/mafsa as a Trie alternative. The result has been merged into dvirsky's code (the structure is called MinTree) and it should theoretically provide better memory usage than the naive Trie he started out with (though I have not benchmarked it).
- Useful functionality exposed at the top-level bleve package. At the simplest level we should make it easy for users to get suggestions from their term dictionary. For more advanced users we should make it possible for advanced users to augment this to provide better suggestions.
I have added a FuzzySearch implementation that uses the MinTree structure mentioned above to bleve as a proof-of-concept (POC) and added a benchmark. The benchmark shows a speedup of about 20x compared to the current implementation but this does not include the time needed to populate the MinTree structure with the terms of the FieldDict. The code can be found here.
https://github.com/Shugyousha/bleve/tree/mafsafuzzy
The POC uses the upside_down.IndexReader to expose the functionality and the FieldCache to store the MinTrees built for each field. The following questions/issues remain and should be discussed.
- At an even higher level there should be some composability. We already support searching across multiple indexes with the index alias feature. This is useful for sharding indexes. But, of course if you shard your data, now your term dictionary can vary across the shards. So, getting suggestions can require similar scatter/gather requests.
I have not thought about this at all yet...
Cheers,
Silvan
Finally, as described so far, all these auto-competion/suggester instances are something you programatically create. But, because they can be expensive to create/update, there is a case to be made for them to have some persistance, possibly along with the index itself (not necessarily in the k/v store, but colocated on the filesystem). At that point, I'm wondering if their existance should somehow be part of the index mapping.
As I said, all of this is open for discussion. So if you have thoughts on any of these topics, please share them.
I like the Rust implementation linked by @mschoch
I went through the source code, but my Rust knowledge is non existent. He links to two research papers explaining how to build an FSA and then an FST. I started the first paper, but am too busy to really get into it. The requirements are to insert terms in lexicographic order. The index is compressed wonderfully and his implementation can store the index on disk or in memory (which we should too). Another benefit is we can use that same index to match regex queries. The key is intersecting automatons.
I peaked into Bleve's source code and I see that we have only one implementation of an index. I suppose we could use the new method offered by @Shugyousha but some of the literature states that a distance of 2 is optimal both in accuracy and in creation time.
If someone could assist me with the research papers and figure out how to represent the FST efficiently, I would be able to help out. Otherwise, I'll have to wait for my day job to get less busy :P
On Tue, Mar 29, 2016 at 01:30:45PM -0700, Jacob Amrany wrote:
I like the Rust implementation linked by @mschoch
I went through the source code, but my Rust knowledge is non existent. He links to two research papers explaining how to build an FSA and then an FST. I started the first paper, but am too busy to really get into it. The requirements are to insert terms in lexicographic order. The index is compressed wonderfully and his implementation can store the index on disk or in memory (which we should too). Another benefit is we can use that same index to match regex queries. The key is intersecting automatons.
The mafsa package I used in the POC implements a "minimal acyclic finite state automaton" already and then uses a levenshtein automaton based on sparse vectors to generate the fuzzy strings. I think the sparse vector approach used in the levenshtein automaton is similar in behavior and performance to doing an FSA intersection but I am not 100% sure.
I peaked into Bleve's source code and I see that we have only one implementation of an index. I suppose we could use the new method offered by @Shugyousha but some of the literature states that a distance of 2 is optimal both in accuracy and in creation time.
I use a distance of 2 in my POC but making the distance variable should not have a disadvantage.
If someone could assist me with the research papers and figure out how to represent the FST efficiently,
The mafsa package at https://github.com/smartystreets/mafsa implements an FSA. I don't know how efficient it is though (it includes a functionality to serialize the mafsa into a binary representation as well).
Thanks @Shugyousha
When I get home tonight, I will play with the mafsa implementation. It's better to make something work and then optimize the index. Your idea with a variable distance search may need to happen during creation time since the automaton changes its size based on the distance.
Hi. If anyone is interested - I've ported my implementation to C and optimized it considerably (converted the simple automaton into a DFA which works very fast), as part of a search engine project built on top of Redis. https://github.com/RedisLabsModules/RediSearch/tree/master/src/trie
Also the trie implementation itself is much more memory efficient but way more complex. Porting it back to Go however should not be difficult, just tedious.
@dvirsky Do you have any plans to Port it to Go?
See: http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html
Also, according to this post:
https://news.ycombinator.com/item?id=8202764
The author has already written a Go implementation. Unfortunately I have no way to reach this person.