Using levenshtein algorithm instead of difflib

zuhanit commented 1 year ago

The difflib library is fast to ignore in small size, it going to slower more than more comparison targets. Below is benchmark result of most close matches from eudplib dictionary.

benchmark1

It's ok as small as a decimal point, it takes a time when get from large dictionary like below. On the other hand, using levenshtein takes little time.

Below is benchmark result using all sounds exist in Starcraft: Remastered. Each sound have long name, and SC:R have 35232 sounds even more.

benchmark2

Depending on the circumstances, the difference can be even greater.

benchmark3

For comparison, I made get_close_matches similar to difflib.get_close_matches function. The Levenshtein ratio is from python-levenshtein library.

def get_close_matches(word, possiblities, n=3, cutoff=0.6):
    """Get close matches by using levenshtein distance.

    Similar to difflib.get_close_matches() but this uses levenshtein distance,
    the faster approach.
    """

    if not n > 0:
        raise ValueError("n must be > 0: %r" % (n,))
    if not 0.0 <= cutoff <= 1.0:
        raise ValueError("cutoff must be in [0.0, 1.0]: %r" % (cutoff,))
    result = []
    for x in possiblities:
        ratio = Levenshtein.ratio(word, x)
        if ratio >= cutoff:
            result.append((ratio, x))

    result = nlargest(n, result)
    return [x for score, x in result]

You can benchmark yourself using below.

import pandas as pd
import timeit

setup = """
import Levenshtein
from eudplib.core.rawtrigger.strdict import DefUnitDict, DefAIScriptDict, DefImageDict, DefSoundDict
from difflib import get_close_matches as dmatches
from heapq import nlargest

def ematches(word, possiblities, n=3, cutoff=0.6):
    if not n > 0:
        raise ValueError("n must be > 0: %r" % (n,))
    if not 0.0 <= cutoff <= 1.0:
        raise ValueError("cutoff must be in [0.0, 1.0]: %r" % (cutoff,))
    result = []
    for x in possiblities:
        ratio = Levenshtein.ratio(word, x)
        if ratio >= cutoff:
            result.append((ratio, x))

    result = nlargest(n, result)
    return [x for score, x in result]
"""

def benchmark(name, func):
  print(f"Starting {name}")
  result = []
  start = timeit.default_timer()
  test = timeit.Timer(func, setup=setup)
  result.append(min(test.timeit(number = 1) for _ in range(7)))
  stop = timeit.default_timer()
  print(f"finished {name}, Runtime: {stop - start}")
  return result

def scorer_benchmark(s, dictname):
  time_levenshtein = benchmark(f"Levenshtein {dictname}", f"ematches(\"{s}\", {dictname})")
  time_difflib = benchmark(f"difflib {dictname}", f"dmatches(\"{s}\", {dictname})")

  df = pd.DataFrame(
    data = {
      "levenshtein": time_levenshtein,
      "difflib": time_difflib
    }
  )

  df.to_excel(f"benchmark_{dictname}_wrong.xlsx")

scorer_benchmark("Terran Aarine", "DefUnitDict")
scorer_benchmark(b"Terran Oustom Level", "DefAIScriptDict")
scorer_benchmark("Scourge", "DefImageDict")
scorer_benchmark("Interstitials/sounds/scHD_Interstitial_Terran_Out", "DefSoundDict")

You'll need DefSoundDict to run above. Becuase the content is so long, I'll upload at gist.

https://gist.github.com/zuhanit/b4bbab033076f70d75626983a93a6256

If any errors, please notify to me.

Thank you!

armoha commented 1 year ago

A bit worrying point is that Ratcliff/Obershelp algorithm (difflib uses) and Levenshtein algorithm (aka edit distance) have different notions of similarity and this change could affect suggestion quality, better or worse. difflib also takes junk characters into account, ignoring insignificant whitespace in calculating ratio.

Entry mismatch is compile error so performance shouldn't be a top priority here. Adding dict for 35232 sound file paths can help end users but we can do better by providing structured way like Pissed("unit name")[2], getting rid of supplying full file path.

armoha commented 1 year ago

By the way, EncodeSound should not raise compile error on following example:

DoActions(PlayWAV("new sound"))
MPQAddFile("new sound", open("file path", "rb").read())

We need to either postpone raising error and suggestion until end of collecting phase, or don't raise error at all (current behavior).

zuhanit commented 1 year ago

I will respect your opinion whether it's accepted or not. Thank you for kind answer!

armoha / euddraft

Using levenshtein algorithm instead of difflib #96