MaartenGr / PolyFuzz

Fuzzy string matching, grouping, and evaluation.
https://maartengr.github.io/PolyFuzz/
MIT License
725 stars 68 forks source link

v0.4.0 #35

Closed MaartenGr closed 2 years ago

MaartenGr commented 2 years ago

SentenceTransformers, Gensim, USE, and Spacy

SentenceTransformers

from polyfuzz.models import SentenceEmbeddings
distance_model = SentenceEmbeddings("all-MiniLM-L6-v2")
model = PolyFuzz(distance_model)

Gensim

from polyfuzz.models import GensimEmbeddings
distance_model = GensimEmbeddings("glove-twitter-25")
model = PolyFuzz(distance_model)

USE

from polyfuzz.models import USEEmbeddings
distance_model = USEEmbeddings("https://tfhub.dev/google/universal-sentence-encoder/4")
model = PolyFuzz(distance_model)

Spacy

from polyfuzz.models import SpacyEmbeddings
distance_model = SpacyEmbeddings("en_core_web_md")
model = PolyFuzz(distance_model)

fit, transform, fit_transform

Add fit, transform, and fit_transform in order to use PolyFuzz in production (#34)

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from polyfuzz import PolyFuzz

train_words = ["apple", "apples", "appl", "recal", "house", "similarity"]
unseen_words = ["apple", "apples", "mouse"]

# Fit
model = PolyFuzz("TF-IDF")
model.fit(train_words)

# Transform
results = model.transform(unseen_words)

In the code above, we fit our TF-IDF model on train_words and use .transform() to match the words in unseen_words to the words that we trained on in train_words.

After fitting our model, we can save it as follows:

model.save("my_model")

Then, we can load our model to be used elsewhere:

from polyfuzz import PolyFuzz

model = PolyFuzz.load("my_model")