MaartenGr / PolyFuzz

Fuzzy string matching, grouping, and evaluation.
https://maartengr.github.io/PolyFuzz/
MIT License
725 stars 68 forks source link

[Feature Request] Q-Gram distance metric #37

Closed carschno closed 2 years ago

carschno commented 2 years ago

Great tool! How about adding q-gram (character n-grams) distance? It is similar to edit distance, but might impose additional challenges regarding computational complexity. I have an index-based q-gram implementation here. However, an index is presumably not applicable in this scenario.

MaartenGr commented 2 years ago

Although Q-Grams is currently not standard in PolyFuzz, you can install strsimpy with pip install strsimpy and implement Q-Grams as a custom model in PolyFuzz:

import numpy as np
import pandas as pd

from polyfuzz import PolyFuzz
from polyfuzz.models import BaseMatcher
from strsimpy.qgram import QGram

class MyModel(BaseMatcher):
    def match(self, from_list, to_list, **kwargs):
        qgram = QGram()

        # Calculate distances
        matches = [[qgram.distance(from_string, to_string)
                   for to_string in to_list] for from_string in from_list]

        # Get best matches
        mappings = [to_list[index] for index in np.argmin(matches, axis=1)]
        scores = np.min(matches, axis=1)

        # Prepare dataframe
        matches = pd.DataFrame({'From': from_list,
                                'To': mappings, 
                                'Similarity': scores})
        return matches

Then, simply use the custom model as you would any other:

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

custom_matcher = MyModel()

model = PolyFuzz(custom_matcher).match(from_list, to_list)
model.get_matches()