eddysystems / eddy

eddy - autocorrect for java
Other
48 stars 6 forks source link

Use automata for incremental Levenshtein distance #14

Open eddysystems opened 9 years ago

eddysystems commented 9 years ago

From @girving on February 8, 2015 21:23

This is a pretty cool trick, and would dramatically optimize typo search:

http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata

Copied from original issue: eddysystems/secret#79

eddysystems commented 9 years ago

We may have to look at our distance a bit more carefully todo this. Because we have a swap cost which is typically lower than two replacements, ours may not actually be a distance at all (although I'm sure it could be modified to one fairly easily).

In particular, the cost of

abc -> acb is one swap acb -> cab is one swap But abc -> cab is three replacements (I think, or maybe two replacements and a swap), definitely more than two swaps.

That said, just going with a simpler typo cost will not hurt much either.

On Sunday, February 8, 2015, Geoffrey Irving notifications@github.com wrote:

This is a pretty cool trick, and would dramatically optimize typo search:

http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata

— Reply to this email directly or view it on GitHub https://github.com/eddysystems/eddy/issues/79.

eddysystems commented 9 years ago

From @girving on February 9, 2015 19:46

Yeah, our metric is more complicated, but I think that means we get all the more benefit if we can turn it into an automaton. If we have a mechanism to do that, it may let us abstract the definition of the metric into a much cleaner-to-reason about form, such as a description which takes naive exponential time to execute (code generation is a beautiful thing).

As to your example, that seems like a bug. I'm not sure if one theoretically wants a triangle-equality-satisfying metric in general, but that particular example doesn't seem like it should be the counterexample.

In the long run, and maybe for the next round, we could possibly just brute force train an automaton for a bunch of real data. That way we get the computational form we want, and with no incorrect baked in assumptions. The tricky part of that is that it can't quite be a raw finite state machine because its size has to be roughly independent of the number of characters in the alphabet.

eddysystems commented 9 years ago

I agree it is a bug, conceptually. It's there for efficiency: I don't know how to transform the distance computation into a pure distance (while keeping the different, lower swapping cost) without making it significantly more expensive.

Of course, if we can transform the "proper" version of the distance into something fast, that's not a concern.

Geoffrey Irving wrote:

Yeah, our metric is more complicated, but I think that means we get all the more benefit if we can turn it into an automaton. If we have a mechanism to do that, it may let us abstract the definition of the metric into a much cleaner-to-reason about form, such as a description which takes naive exponential time to execute (code generation is a beautiful thing).

As to your example, that seems like a bug. I'm not sure if one theoretically wants a triangle-equality-satisfying metric in general, but that particular example doesn't seem like it should be the counterexample.

In the long run, and maybe for the next round, we could possibly just brute force train an automaton for a bunch of real data. That way we get the computational form we want, and with no incorrect baked in assumptions. The tricky part of that is that it can't quite be a raw finite state machine because its size has to be roughly independent of the number of characters in the alphabet.

— Reply to this email directly or view it on GitHub https://github.com/eddysystems/eddy/issues/79#issuecomment-73576334.