StringCompare is a Python package for efficient string similarity computation and approximate string matching. It is inspired by the excellent comparator and stringdist R packages, and from the equally excellent py_stringmatching, jellyfish, and textdistance Python packages.
The key feature of StringCompare is a focus on speed, extensibility and maintainability through its pybind11 C++ implementation. StringCompare is faster than most other Python libraries (see benchmark below) and much more memory efficient when dealing with long strings.
The complete API documentation is available on the project website olivierbinette.github.io/StringCompare.
Install the released version from github using the following commands:
pip install git+https://github.com/OlivierBinette/StringCompare.git@release
StringCompare currently implements edit distances and similarity functions, such as the Levenshtein, Damerau-Levenshtein, Jaro, and Jaro-Winkler distances. This is stage 1 of the following development roadmap:
Stage | Goals | Status |
---|---|---|
1 | pybind11 framework and edit-based distances (Levenshtein, Damerau-Levenshtein, Jaro, and Jaro-Winkler) | ✔️ |
2 | Token-based and hybrid distances (tf-idf similarity, LSH, Monge-Elkan, ...) | ... |
3 | Vocabulary optimizations and metric trees | ... |
4 | Embeddings and string distance learning | ... |
Comparison algorithms are instanciated as Comparator
object, which provides the compare()
method (equivalent to __call__()
) for string comparison.
from stringcompare import Levenshtein, Jaro, JaroWinkler, DamerauLevenshtein, LCSDistance
lev = Levenshtein(normalize=True, similarity=False)
lev("Olivier", "Oliver") # same as lev.compare("Olivier", "Oliver")
0.14285714285714285
Comparator objects also provide the elementwise()
function for elementwise comparison between lists
lev.elementwise(["Olivier", "Olivier"], ["Oliver", "Olivia"])
array([0.14285714, 0.26666667])
and the pairwise()
function for pairwise comparisons.
lev.pairwise(["Olivier", "Oliver"], ["Olivier", "Olivia"])
array([[0. , 0.26666667],
[0.14285714, 0.28571429]])
Comparison of the Damerau-Levenshtein implementation speed for different Python packages, when comparing the strings "Olivier Binette" and "Oilvier Benet":
from timeit import timeit
from tabulate import tabulate
# Comparison functions
from stringcompare import DamerauLevenshtein
cmp = DamerauLevenshtein()
from jellyfish import damerau_levenshtein_distance
from textdistance import damerau_levenshtein
functions = {
"StringCompare": cmp.compare,
"jellyfish": damerau_levenshtein_distance,
"textdistance": damerau_levenshtein,
}
table = [
[name, timeit(lambda: fun("Olivier Binette", "Oilvier Benet"), number=1000000) * 1000]
for name, fun in functions.items()
]
print(tabulate(table, headers=["Package", "avg runtime (ns)"]))
Package avg runtime (ns)
------------- ------------------
StringCompare 697.834
jellyfish 974.363
textdistance 3982.73
The use of pybind11 comes with a small performance overhead. We could be faster if we directly interfaced with CPython.
However, the use of pybind11 allows the library to be easily extensible and maintainable. The C++ implementation has little to worry about Python, excepted for the use of a pybind11 numpy wrapper in some places. Pybind11 takes care of the details of exposing the C++ API to Python.
pybind11 has compatibility issues with gcc 11 (e.g. on Ubuntu 21.10). If running Linux and gcc --version
is 11, then use the following commands to configure your environment before (re)installing:
sudo apt install g++-9 gcc-9
export CC=gcc-9 CXX=g++-9
If this is unsuccessful, you might want to use StringCompare within a Docker container. I recommend using the python:3.7.9 base image. For example, after installing docker, you can launch an interactive bash session and install StringCompare as follows:
sudo docker run -it python:3.7.9 bash
pip install git+https://github.com/OlivierBinette/StringCompare.git
python
>>> import stringcompare
Please report installation issues here.
StringCompare is currently in early development stage and contributions are welcome! See the contributing page for more information.
This project is made possible by the support of the Natural Sciences and Engineering Research Council of Canada (NSERC) and by the support of a G-Research grant.
I would also like to thank the support of my individual Github sponsors.