MaartenGr / PolyFuzz

Fuzzy string matching, grouping, and evaluation.
https://maartengr.github.io/PolyFuzz/
MIT License
736 stars 67 forks source link

Separate/together way of writing and synonymes aren't recognized #31

Open e-orlov opened 2 years ago

e-orlov commented 2 years ago

Keywords "trinkwasser test", "trinkwassertest" and "analyse trinkwasser" aren't clustered at all.

MaartenGr commented 2 years ago

Which version of PolyFuzz are you using? Also, could you create a reproducible example? Since PolyFuzz can use many models, without any code it is difficult to see what is happening in your use case.

e-orlov commented 2 years ago

I'm using IF-IDF, implemented under https://share.streamlit.io/charlywargnier/keyword-clustering-app/main/app.py / https://github.com/searchsolved/search-solved-public-seo/blob/main/Keyword_Clustering_Tool/Keyword_Clustering_Tool_V2.ipynb (codeblock 12)

Keywords are here: https://docs.google.com/spreadsheets/d/1nkiFNO8JadbaFcL7BvYKCLNPYPB5ILJwk2K__2DOzdc/edit?usp=sharing

Maybe PolyFuzz is not a right tool for this. To catch "trinkwasser test" and "trinkwassertest" into the same cluster, keywords must be permutated and then searched for a minimal Levenshteyn between permutations. But for "trinkwasser test" and "analyse trinkwasser" the should be probably any "real" synonyme search, maybe even based on a synonym vocabulary...

MaartenGr commented 2 years ago

Let me start by saying that I cannot give much support for that tool specifically as I did not create it. Having said that, I did try it out with PolyFuzz directly and it seems that "trinkwasses test" gets grouped with "trinkwassertest" but not with "analyse trinkwasser". Most likely, using TF-IDF they are simply not similar enough to each other. You can try to circumvent this issue by using a different technique than TF-IDF as it tries to mirror Levenshtein distance.

You can implement or use any distance measure in PolyFuzz that you would like. However, if you are looking at semantic similarity and not such much string similarity, then I would advise going for embedding-based methods such as BERT models, sentence-transformers, Hugging Face, or Flair.

You can find more information about that here and here.