Bergvca / string_grouper

Super Fast String Matching in Python
MIT License
364 stars 76 forks source link

match_string on small data series #65

Closed berndnoll closed 3 years ago

berndnoll commented 3 years ago

Hi, I was just curious about what happens when I run this piece of code. I came across this when I split my data into smaller chunks.

Code: import pandas as pd from string_grouper import match_strings

accounts = pd.DataFrame() accounts['name'] = ['Jim Beam','Jim Boom','Jack Daniels','John Dummel','Bob Bubble','Seth Suckerman']

matches = match_strings(accounts['name']) print(matches)

Output: left_index left_name similarity right_name right_index 0 0 Jim Beam 1.0 Jim Beam 0 1 1 Jim Boom 1.0 Jim Boom 1 2 2 Jack Daniels 1.0 Jack Daniels 2 3 3 John Dummel 1.0 John Dummel 3 4 4 Bob Bubble 1.0 Bob Bubble 4 5 5 Seth Suckerman 1.0 Seth Suckerman 5

Am I doing something wrong here? I hope this is not too dumb of a question, I am new to py and pandas.

Thank you for looking into this.

ParticularMiner commented 3 years ago

Hi @berndnoll

It doesn't look like you are doing anything wrong. Were you expecting a different result?

berndnoll commented 3 years ago

Ha! I was expecting a different result, but I got "tricked" by min_similarity. I was expecting a higher match score for elements 1 and 2. Once you scale it down, it turns out it's only a very low score.

matches = match_strings(accounts['name'],min_similarity=0.3)

left_index left_name similarity right_name right_index 0 0 Jim Beam 1.000000 Jim Beam 0 1 0 Jim Beam 0.309527 Jim Boom 1 2 1 Jim Boom 0.309527 Jim Beam 0 3 1 Jim Boom 1.000000 Jim Boom 1 4 2 Jack Daniels 1.000000 Jack Daniels 2 5 3 John Dummel 1.000000 John Dummel 3 6 4 Bob Bubble 1.000000 Bob Bubble 4 7 5 Seth Suckerman 1.000000 Seth Suckerman 5

Sorry for bothering you with this and thanks again for your awesome support.