anhaidgroup / py_stringmatching

A comprehensive and scalable set of string tokenizers and similarity measures in Python
https://sites.google.com/site/anhaidgroup/projects/py_stringmatching
BSD 3-Clause "New" or "Revised" License
135 stars 16 forks source link

py_stringmatching's jaro_winkler is slower than pure-Python jellyfish's jaro_winkler #55

Closed fjsj closed 5 years ago

fjsj commented 5 years ago

Hi, I suppose there's something wrong with cython_jaro_winkler. Looks like it's much slower than a pure-Python implementation, like the one in jellyfish project.

On my machine, macOS Mojave 10.14.2 (18C54):

In [1]: import timeit                                                                                                                                                            

In [2]: timeit.Timer('jw.get_raw_score(\'DIXON\', \'DICKSONX\')', setup='from py_stringmatching.similarity_measure.jaro_winkler import JaroWinkler; jw = JaroWinkler()').timeit(number=10000)
Out[2]: 0.1117161939619109

In [3]: timeit.Timer('jellyfish.jaro_winkler(\'DIXON\', \'DICKSONX\')', setup='import jellyfish').timeit(number=10000) 
Out[3]: 0.004220786038786173

Also tested on a Ubuntu 16.04.4 server. Similar results.

Versions: Python 3.6.7 jellyfish==0.6.1 py-stringmatching==0.4.0

fjsj commented 5 years ago

Nevermind. I was misled by jellyfish source code. It does have a C-version. The correct execution is:

import timeit
timeit.Timer('jellyfish.jaro_winkler(\'DIXON\', \'DICKSONX\')', setup='from jellyfish import _jellyfish as jellyfish').timeit(number=10000) 

Which gives similar results to py_stringmatching.

anhaid commented 5 years ago

Hi. Thank you very much for using our code, and I'm glad that this issue has been resolved (and sorry for the late reply. We have been on winter break :). The Magellan team.