Bergvca / string_grouper

Super Fast String Matching in Python
MIT License
362 stars 76 forks source link

Add functionality to include an ID column to the string matching. #19

Closed justasojourner closed 3 years ago

justasojourner commented 3 years ago

Hi @Bergvca - to make it easy for you I forked the code and made a branch (add_id) with changes.

The code extends the existing functions and adds a **kwarg option to add an id column.

It is called this way:

matches = match_strings(accounts['name'], id_col = accounts['account_id'])

It is working fine for me (doing specific task as above).

Trust it meets your approval.

justasojourner commented 3 years ago

Hi @Bergvca - all the coding changes to support IDs have been coded by @ParticularMiner and myself and committed. My colleague wrote all unit tests and added them to test_string_grouper.py. He also wrote a live testing script (just local, not committed to GitHub project) to test all functionality. We believe it should be complete.

Could you review.

justasojourner commented 3 years ago

Hello @Bergvca

Have been looking at the addition/updates to the code, specifically with respect to how the ID information is passed to the function. It seems that it may be better to pass them as optional parameters, the same as 'duplicates', rather than as a keyword argument.

So change the function call from this:

def match_strings(master: pd.Series, duplicates: Optional[pd.Series] = None, **kwargs) -> pd.DataFrame:

to this...

def match_strings(master: pd.Series, duplicates: Optional[pd.Series] = None, id_col: Optional[pd.Series] = None, dup_id_col: Optional[pd.Series] = None, **kwargs) -> pd.DataFrame:

or maybe more logical...

def match_strings(master: pd.Series, id_col: Optional[pd.Series] = None, duplicates: Optional[pd.Series] = None, dup_id_col: Optional[pd.Series] = None, **kwargs) -> pd.DataFrame:

What do you think? We wouldn't mind making the necessary changes and updating the unit tests as required.

Also would you like me to update the README to explain the new functionality, as part of the pull request commit you can review it.

justasojourner commented 3 years ago

OK, we made the design changes to pass the IDs as parameters to the StringGrouper class. At the same time we renamed them to something we felt was more clear, so master_id for master.

class StringGrouper(object):
    def __init__(self, master: pd.Series,
                 duplicates: Optional[pd.Series] = None,
                 master_id: Optional[pd.Series] = None,
                 duplicates_id: Optional[pd.Series] = None,
                 **kwargs):

We kept the existing parameter passing order for master then duplicates. The code for get_matches was made more succinct.

All tested, and required unit tests (amended for updated code) added to test_string_grouper.py.

justasojourner commented 3 years ago

Thanks! It can be merged, we've gone over it pretty throughly and are happy. I'm using it in a production use case I have — investigating thousands of dupes in a Salesforce account database. I went from multiple hours using repetitious looping over a 65K^2 iterations to String grouper processed 65K records in 3.6790 seconds.

If you would like us to looking at using the same logic for IDs into the other two functions I know @ParticularMiner would be happy to.

For the docs I don't mind making a draft that you can review.