anhaidgroup / py_stringsimjoin

Scalable String Similarity Joins in Python
BSD 3-Clause "New" or "Revised" License
39 stars 17 forks source link

py_stringsimjoin

This project seeks to build a Python software package that provides scalable implementation of string similarity joins over two tables, for commonly used similarity measures such as Jaccard, Dice, cosine, overlap, overlap coefficient and edit distance. The package is free, open-source, and BSD-licensed.

Important links

Dependencies

py_stringsimjoin has been tested on each Python version between 3.7 and 3.12, inclusive.

The required dependencies to build the package are pandas 0.16.0 or higher, py_stringmatching 0.2.1 or higher, joblib, pyprind, six and a C++ compiler. For the development version, you will also need Cython.

Platforms

py_stringsimjoin has been tested on Linux, OS X and Windows.