AndreasMadsen / editdistance

A much faster than the naïve levenshtein distance algoritme
MIT License
7 stars 1 forks source link

benchmarks #1

Closed kesla closed 9 years ago

kesla commented 9 years ago

How does this compare to some of the levenshtein-libraries that exists for node?

AndreasMadsen commented 9 years ago

I read all the other levenshtein implementations before implementing this one. As far as I can see they all uses the naïve approach O(n*m). Some are actually really bad and runs with O(max(n,m)^3) time complexity. The Berghel Roach Levenshtein algorithm uses O(n + d^2) (n longest string, d edit distance). You can check out http://berghel.net/publications/asm/asm.pdf for more details. But practice I have seen about a 80% performance increase in article (compared to the next fastest I could find). And this is measured on the entire processing stack, so it is quite a lot.

An important note is that the Berghel Roach algorithm supports more edit types than other algorithm so you may not get the same results.

You should also be aware that the levenshtein distance is an actual distance measure, so if you are searching for the smallest levenshtein distance (eq. spell correction) you can actually be very intelligent about your search, because you can use the triangle inequality rule.

However the subject is quite complex, the author mwyoung from GWT sums it up quite well in a code comment few people will ever find (http://google-web-toolkit.googlecode.com/svn-history/r8890/trunk/dev/core/src/com/google/gwt/dev/util/editdistance/GeneralEditDistances.java), the quote is below.

I may write a benchmark soon, but getting good real life data is quite hard. The approx. 80% is the best I can give you now.

  /*
   * As of 2007-08-23, the best algorithm known (to the author=mwyoung) for
   * short strings is one due to Eugene Myers, except for the special case
   * where the distance limit is 0 or 1.  The Myers algorithm also has good
   * worst-case performance for long strings when the edit distance is not
   * reasonably bounded.
   *
   * When there is a good bound, a variant of the Ukkonen algorithm due to
   * Berghel and Roach (modified by Michael Young to use linear space)
   * is faster for long strings.
   *
   * Note that other algorithms that perform better in some cases for running
   * text searches do not outperform Myers for rigid distance computations.
   * Notably:
   *   Navarro/Baeza-Yates (Algorithmica 23,2) simulates an NFA with an
   *   epsilon-cycle on the initial state (appropriate for running texts)
   *   and reports success without computing exact distance.  When adjusted
   *   to a fixed starting point and computing distance, its state machine
   *   is larger and it underperforms.
   *
   *   BITAP (Baeza-Yates/Gonnet, Manber/Wu) also simulates an NFA, and
   *   Navarro claims that it wins for small patterns and small limits for
   *   running search.  Experiments with a Java implementation showed that
   *   it beat Myers on pure string edit distance only for limits where the
   *   special 0-1 limit applied, where special-case comparison beats all.
   *
   * A survey of algorithms for running text search by Navarro appeared
   * in ACM Computing Surveys 33#1: http://portal.acm.org/citation.cfm?id=375365
   * Another algorithm (Four Russians) that Navarro claims superior for very
   * long patterns and high limits was not evaluated for inclusion here.
   * Filtering algorithms also improve running search, but do not help
   * for pure edit distance.
   */
AndreasMadsen commented 9 years ago

Okay so I just wrote a big benchmark suite using real life data from 258 new articles (see: https://github.com/AndreasMadsen/levenshtein-benchmark). The result is:

editdistance              : 7.61528 ms/file ± 8.72766
fast-levenshtein          : 8.95615 ms/file ± 10.6638
ld                        : 21.8346 ms/file ± 26.7389
levdist                   : 22.5638 ms/file ± 25.6466
leven                     : 3.00009 ms/file ± 4.06383
levenshtein-component     : 17.7153 ms/file ± 21.6396
levenshtein-edit-distance : 3.39528 ms/file ± 4.20907
levenshtein               : 39.6751 ms/file ± 63.4896
natural                   : 53.5635 ms/file ± 61.5447

So leven is apparently the fastest, however from looking at the code (which is really hard to read so I could be wrong) it don't support that many edit types. levenshtein-edit-distance appear to be almost identical to leven in terms of edit types.

Some of these also gives you the edit matrix which should be credited, but will require a slower algorithm.

Fun fact: none of these modules compare itself with this module, and the benchmarks they use are hilariously synthetic :)

kesla commented 9 years ago

@AndreasMadsen You rock, thanks!

kesla commented 9 years ago

Very interesting read, I must say.

AndreasMadsen commented 9 years ago

By the way, if this is really important you should compare using your own dataset. This algorithms speed depends very much on the amount edits required, while the other just depends on the string lengths. So if you have a high number of edits (like in my benchmark suite), it may be better to use a more naïve approach, since this might be easier for v8 to optimize.