Open theronic opened 6 years ago
Hello @theronic. Basically, all the library's algorithms can be thought as O(n)
, n being the size of your strings except for Levenshtein which is O(m * n)
which is really O(n^2)
in most cases.
To sum up: every distance except for Levenshtein is O(m + n)
where m is the length of the first string and n the length of the second (this obviously simplifies to O(n)
). And every phonetic algorithm & stemmer is O(n)
where n is the length of the given string.
So usually Big O is not very useful to determine the practical performance of those algorithm which are basically trivial in complexity but can hide a certain amount of constant operation under the hood? What I mean is that computing a metaphone
code is more expensive than computing the soundex
because the latter needs way less operations while still having the same overall time complexity.
What's more, it's rather the difficulty of the fuzzy matching algorithm that must be considered. Are you performing clustering, fuzzy search? Most of the naive version of the algorithms solving these problem are quadratic thus negating the need to even consider the complexity of the stemmers/distances etc. since you are computing pairwise information. But to be able to scale up, you need guarantees on the distances such as triangle inequality etc. and this often means that you will accept using Levenshtein even if its complexity is bad. But I could help you a bit more if you detail your use case, if you want to.
Also, usually people tend to focus more on the algorithm's precision and its adequation to the use case rather than its inherent complexity. This library is now a bit old and I have moved to developping more algorithm for the Talisman library. It comes with an interactive documentation that lets you try the algorithms on line so you can better choose one.
Great to see a library like this. I would love to see the Big-O performance of each fuzzy algorithm displayed so I know what size of data I can it for and maybe some advice about pros and cons.
I'm doing some fuzzy matching for accounting purposes ("McDonalds": "is this a business expense? probably not") and I wouldn't know which algo to pick to save the most time