Open dariswan opened 2 years ago
Hi @dariswan
The latest version is supposed to be much faster than older versions as your dataset-size increases. I would be interested to see how group_similar_strings
failed. If possible, could you send me a code/data sample that reproduces the failure?
Thanks.
Hi @ParticularMiner
There are no failures in group_similar_strings but I saw them as human eyes, sometimes giving inaccurate results to a single term. In my case, i tried to group the similar email with default similarity (80%), for example
Those 3 email suppose to in one group as human eyes
Hi @dariswan
For such a small set of strings the default similarity threshold (80%) is too large. Try 60%:
import pandas as pd
from string_grouper import group_similar_strings
emails = pd.Series(['messi1@gmail.com', 'messi12@gmail.com', 'messi21@gmail.com'])
email_df = emails.to_frame()
email_df[['group_id', 'group_rep']] = group_similar_strings(emails, min_similarity=0.64)
email_df
0 | group_id | group_rep | |
---|---|---|---|
0 | messi1@gmail.com | 0 | messi1@gmail.com |
1 | messi12@gmail.com | 0 | messi1@gmail.com |
2 | messi21@gmail.com | 0 | messi1@gmail.com |
Hi @ParticularMiner
Yes I agreed with you, my threshold right now is 70% It much better result, so back to the main question, the way this module clustering string is still in the same way between 0.6.1 and 0.3.2 (i upgraded lit bit)
Thank you for the answer
Dear developer,
Could you get me an explanation about the different versions of string_grouper? I only use one function named as "group_similar_strings", currently I am using 0.1.1 version, but the latest version now is 0.6.1
this library is very helpful and great, but when I used function group_similar_strings with customer similarity, sometimes the result missed group the group as I checked human eyes. Is it worth it if I upgrade the version to the latest version,? what is the improvement?