MaartenGr / PolyFuzz

Fuzzy string matching, grouping, and evaluation.
https://maartengr.github.io/PolyFuzz/
MIT License
725 stars 68 forks source link

Text normalization and save/load usage #34

Closed fourat-bs closed 2 years ago

fourat-bs commented 2 years ago

Is there any way that PolyFuzz can be used to normalize sequences. For example, fit the model to a normalized sequences list > save the model > load the model > normalize new data.

model = PolyFuzz("TF-IDF").fit(normalized_sequences_list)
model.save('/path/to/model')
model = PolyFuzz.load('/path/to/model')
normalized_data = model.transform(unormalized_data)
MaartenGr commented 2 years ago

Normalization is currently not implemented in PolyFuzz as it might be a bit outside of the string matching nature of the package. Having said that, I am not entirely sure I understand how you would want to use TF-IDF to normalize lists of strings. In this example, what would a normalized list of strings look like to you?

fourat-bs commented 2 years ago

I used TF-IDF as an example, but you can use any model instead to obtain vector representation of strings. Here is an example of normalized list of strings: ['senior software developer', 'automation engineer', 'software architect'] later, if I want to normalize the following list ['experienced software developer', 'enterprise software architect'] hopefully the model will output ['senior software developer', 'software architect'] which are the closest ones.

MaartenGr commented 2 years ago

Ah, in that case, I would use PolyFuzz as is by making sure you always supply the same to_list in the following example:

from polyfuzz import PolyFuzz

from_list = ['experienced software developer', 'enterprise software architect']
to_list = ['senior software developer', 'automation engineer', 'software architect']

model = PolyFuzz("TF-IDF")
model.match(from_list, to_list)

Using the above example, you can decide what to put in the to_list in order to standardize the input list (from_list).

fourat-bs commented 2 years ago

This is what exactly I am doing, but for the to_list I am using (around 20k) it's slowing down the program by computing the vectors of the same to_list every time. After rereading the docs, I think this is out of scope of this project.

MaartenGr commented 2 years ago

Ah yes, that is indeed quite troublesome! Yes, that is not possible with how PolyFuzz was designed initially for exploring matches. Having said that, it might be interesting to add that as a feature at some point. The main difficulty here lies in tracking the TF-IDF vectorizer and the embeddings created for the language models in order to use them in .transform. I'll keep it in mind!

fourat-bs commented 2 years ago

Thank you for your time and consideration. PolyFuzz is awesome!

cgpeltier commented 1 year ago

Hi👋 just wanted to note that I'd also be interested in this! We are also have a large corpus (>100k labels) so we need to pre-create embeddings for our to_list to be performant in production.

At the moment we have a custom solution (implemented as a spaCy custom component) where we're using sentence transformers to create the embeddings, but we'd switch to polyfuzz and take advantage of its other features if we could create embeddings ahead of time. Apologies if this is already implemented and I just missed it in the documentation.

MaartenGr commented 1 year ago

@cgpeltier When you use fit in PolyFuzz, it will keep track of the embeddings generated in the to_list such that the model is easier to use in production. For example, when running the following:

from polyfuzz import PolyFuzz
from polyfuzz.models import SentenceEmbeddings

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

distance_model = SentenceEmbeddings("all-MiniLM-L6-v2")
model = PolyFuzz(distance_model).fit(from_list, to_list)

You can find the embeddings from to_list in model.method.embeddings_to. Those would represent the embeddings of your >100k labels. When you then use transform, it will not re-calculate the embeddings for those >100k labels in to_list as it already has done that.

If you already have sentence-transformers embeddings stored somewhere, you would only need to do the following:

from polyfuzz import PolyFuzz
from polyfuzz.models import SentenceEmbeddings

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

distance_model = SentenceEmbeddings("all-MiniLM-L6-v2")
model = PolyFuzz(distance_model).fit(from_list, to_list)
model.method.embeddings_to = my_stored_sentence_transformers_embeddings