MaartenGr / PolyFuzz

Fuzzy string matching, grouping, and evaluation.
https://maartengr.github.io/PolyFuzz/
MIT License
733 stars 67 forks source link

add fuzz transformer #25

Open shahrukhx01 opened 3 years ago

shahrukhx01 commented 3 years ago

Hi @MaartenGr, I have fine-tuned a fuzzy transformer for char level similarity to do fuzzy matching, you can read about how I did here: LinkedIn post explanation: https://www.linkedin.com/feed/update/urn:li:activity:6819456033992253440/ Model on hugging face hub: https://huggingface.co/shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher

Would you like me to create a pull request if it fits PolyFuzz?

Thanks, Shahrukh

MaartenGr commented 3 years ago

You already can! PolyFuzz supports Flair which in turn supports sentence-transformers on which your model is based. If you run the following code, you can use the model:

from polyfuzz import PolyFuzz
from polyfuzz.models import Embeddings
from flair.embeddings import SentenceTransformerDocumentEmbeddings

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

embedding = SentenceTransformerDocumentEmbeddings('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher')
matcher = Embeddings(embedding, min_similarity=0)
model = PolyFuzz(matcher).match(from_list, to_list)
shahrukhx01 commented 3 years ago

thanks for your response, I was able to execute the model, however, the model produces substandard results compared to actual model this is because of the fact, in my implementation before tokenization, I break the input string into characters it really helps the model optimize for the distance objective, for instance, "hello" would preprocessed as "h e l l o". Please let me know how to proceed with this, also would you like me to document this model in Readme? Please see the results below as well 2416004

MaartenGr commented 3 years ago

Hmmm, in that case, would it not be a matter of preprocessing the words before passing them to KeyBERT? Something like this:

from polyfuzz import PolyFuzz
from polyfuzz.models import Embeddings
from flair.embeddings import SentenceTransformerDocumentEmbeddings

embedding = SentenceTransformerDocumentEmbeddings('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher')

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

from_list = [" ".join([char for char in word]) for word in from_list]
to_list = [" ".join([char for char in word]) for word in to_list]

matcher = Embeddings(embedding, min_similarity=0)
model = PolyFuzz(matcher).match(from_list, to_list)

Then you would only need to transform them back into words. I am bit hesitant adding support for a specific model that I currently have no benchmark for. Do you have a paper related to this model?

shahrukhx01 commented 3 years ago

@MaartenGr I plan to write an Arxiv paper on this, however, it could take some time, in the meanwhile would you be okay, if I do I direct comparative analysis of this model with BERT based embedding model in Polyfuzz? I already have the dataset for Fuzzy benchmarking

MaartenGr commented 3 years ago

The thing is with the dataset that you shared is that the value generated are no ground truth since they are computed with Levenshtein. A model that has a focus on char-level embeddings is therefore likely to outperform a model that is not regardless of its actual accuracy. It would be nice if you could test on a dataset that is often used for string-matching research.

shahrukhx01 commented 3 years ago

Could you point me to a dataset that could be used here? Also, is there any chance we can collaborate in writing something formal (an Arxiv paper or something) about different neural approaches for string-matching?

MaartenGr commented 3 years ago

Apologies for the late response. I believe it would take several datasets and evaluation measures to thoroughly validate the model that you created. Although I would be interested in collaborating, I am afraid I currently do not have the time to write an extensive paper on the subject.

shahrukhx01 commented 3 years ago

That won't be a problem, I'm willing to do the write-ups and experimentation since I will be having the summer break from my school. It'd be great if you can help with ideas and reviewing what I do, that'd be more really great. Please let me know if that's possible for you :)

MaartenGr commented 3 years ago

I cannot make any promises but perhaps I can make some time to review ideas and experimentations. It would be interesting to have a nice overview of string similarity based algorithms.