Closed carschno closed 2 years ago
Although Q-Grams is currently not standard in PolyFuzz, you can install strsimpy
with pip install strsimpy
and implement Q-Grams as a custom model in PolyFuzz:
import numpy as np
import pandas as pd
from polyfuzz import PolyFuzz
from polyfuzz.models import BaseMatcher
from strsimpy.qgram import QGram
class MyModel(BaseMatcher):
def match(self, from_list, to_list, **kwargs):
qgram = QGram()
# Calculate distances
matches = [[qgram.distance(from_string, to_string)
for to_string in to_list] for from_string in from_list]
# Get best matches
mappings = [to_list[index] for index in np.argmin(matches, axis=1)]
scores = np.min(matches, axis=1)
# Prepare dataframe
matches = pd.DataFrame({'From': from_list,
'To': mappings,
'Similarity': scores})
return matches
Then, simply use the custom model as you would any other:
from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]
custom_matcher = MyModel()
model = PolyFuzz(custom_matcher).match(from_list, to_list)
model.get_matches()
Great tool! How about adding q-gram (character n-grams) distance? It is similar to edit distance, but might impose additional challenges regarding computational complexity. I have an index-based q-gram implementation here. However, an index is presumably not applicable in this scenario.