Bergvca / string_grouper

Super Fast String Matching in Python
MIT License
364 stars 76 forks source link

update prior master-dupe pairings #12

Closed taimursajid closed 4 years ago

taimursajid commented 4 years ago

When adding a new match, prior matches made to master and dupe pairs will also be added to _matches_list so that new matches flow through.

The following toy example demonstrates how the issue:

sample = [
    'microsoftoffice 365 home',
    'microsoftoffice 365 pers',
    'microsoft office'
    ]

df = pd.DataFrame(sample, columns=['name'])

sg = StringGrouper(df['name'])
sg = sg.fit()

sg = sg.add_match('microsoft office','microsoftoffice 365 home')
sg = sg.add_match('microsoftoffice 365 pers','microsoft office')
df['deduped'] = sg.get_groups()

The existing code will not make add a master-dupe entry for 'microsoftoffice 365 pers' and 'microsoftoffice 365 home' in _matches_list, so microsoftoffice 365 pers will map to itself.

A small change has also been added to the _matches_list so that duplicates are dropped.

Bergvca commented 4 years ago

This looks great! Thanks for finding and fixing this bug. Can I ask you to add a unittest? You could use the example above for example. If not I can create one later.

taimursajid commented 4 years ago

Done.

Bergvca commented 4 years ago

Merged and updated in pypi (version 0.1.2). Thank you very much!