MaartenGr / PolyFuzz

Fuzzy string matching, grouping, and evaluation.
https://maartengr.github.io/PolyFuzz/
MIT License
733 stars 67 forks source link

group didn't work with embenddings #12

Closed ronaldexim closed 3 years ago

ronaldexim commented 3 years ago

If I work with tf-idf it is fine:

from polyfuzz import PolyFuzz
from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
model = PolyFuzz("TF-IDF").match(from_list, from_list)
model.group(link_min_similarity=0.75, group_all_strings=True)
print(model.get_clusters())

{1: ['apples', 'apple', 'appl']}

but with embeddings:

from polyfuzz import PolyFuzz
from polyfuzz.models import Embeddings
from flair.embeddings import TransformerWordEmbeddings
embeddings = TransformerWordEmbeddings('distilbert-base-cased') 
bert = Embeddings([embeddings], min_similarity=0, cosine_method='sklearn', model_id="BERT")
model = PolyFuzz(bert)
from_list = ['respiratory', 'pulmonary', 'lung', 'disease', 'infection', 'illness']
model.match(from_list, from_list)
model.group(link_min_similarity=0.75, group_all_strings=True)
model.get_matches()
respiratory pulmonary 0.959468 pulmonary
pulmonary respiratory 0.959468 respiratory
lung respiratory 0.915398 respiratory
disease infection 0.935804 infection
infection disease 0.935804 disease
illness disease 0.935370 disease
print(model.get_clusters())

{}

same with bert-base-multilingual-cased installed with pip install polyfuzz[flair]

MaartenGr commented 3 years ago

The results are actually a feature and not a bug :sweat_smile:

Under the hood, the grouping is actually done using a TF-IDF instance instead of the embeddings that you passed. If you want to group the results with your embeddings, you will have to pass your BERT model to group:

model.group(bert, link_min_similarity=0.75, group_all_strings=True)

And that's it! I use TF-IDF as default as any embedding or edit distance technique often takes quite a while and I figured that you would want more speed in the group function as a default. You can use any different technique for grouping if you would like.

ronaldexim commented 3 years ago

Thank you - it works great now. I tried with model_id "BERT" but it should be polyfuzz.models.Embeddings - 'bert'. Awesome library :clap: