Question about version string_grouper group_similar_strings

Bergvca / string_grouper

Super Fast String Matching in Python

MIT License

364 stars 76 forks source link

Question about version string_grouper group_similar_strings #80

Open dariswan opened 2 years ago

dariswan commented 2 years ago

Dear developer,

Could you get me an explanation about the different versions of string_grouper? I only use one function named as "group_similar_strings", currently I am using 0.1.1 version, but the latest version now is 0.6.1

this library is very helpful and great, but when I used function group_similar_strings with customer similarity, sometimes the result missed group the group as I checked human eyes. Is it worth it if I upgrade the version to the latest version,? what is the improvement?

ParticularMiner commented 2 years ago

Hi @dariswan

The latest version is supposed to be much faster than older versions as your dataset-size increases. I would be interested to see how group_similar_strings failed. If possible, could you send me a code/data sample that reproduces the failure?

Thanks.

dariswan commented 2 years ago

Hi @ParticularMiner

There are no failures in group_similar_strings but I saw them as human eyes, sometimes giving inaccurate results to a single term. In my case, i tried to group the similar email with default similarity (80%), for example

messi1@gmail.com --> group_1
messi12@gmail.com --> group_2
messi21@gmail.com --> group_3

Those 3 email suppose to in one group as human eyes

ParticularMiner commented 2 years ago

Hi @dariswan

For such a small set of strings the default similarity threshold (80%) is too large. Try 60%:

import pandas as pd
from string_grouper import group_similar_strings

emails = pd.Series(['messi1@gmail.com', 'messi12@gmail.com', 'messi21@gmail.com'])
email_df = emails.to_frame()
email_df[['group_id', 'group_rep']] = group_similar_strings(emails, min_similarity=0.64)
email_df

	0	group_rep
0	messi1@gmail.com	messi1@gmail.com
1	messi12@gmail.com	messi1@gmail.com
2	messi21@gmail.com	messi1@gmail.com

dariswan commented 2 years ago

Hi @ParticularMiner

Yes I agreed with you, my threshold right now is 70% It much better result, so back to the main question, the way this module clustering string is still in the same way between 0.6.1 and 0.3.2 (i upgraded lit bit)

Thank you for the answer