DeNederlandscheBank / name_matching

Other
128 stars 43 forks source link

Indices not matching original data. #15

Closed waleed-aly1 closed 10 months ago

waleed-aly1 commented 1 year ago

@mnijhuis-dnb Thank you so much for this library. I'm very new to Python and this is one of my first projects. Your library was extremely clear, useful and easy to understand and follow. It's great work so I just wanted to mention that first.

I've spent the better part of a month building a database/matching process. I'm attempting to match a list of names from a database table called company_directory against names in an Excel file (which are imported via a custom method). Everything seems to be working correctly, and the match names appear to be the right matches, however, the index is always off from the original data. I can't seem to find any consistency with why that's happening (ie it's not off by a certain number in every instance).

I'm not sure if this is a known error or something wrong I'm doing on my end, but this is the absolute last piece of the puzzle for me, so if I can figure this out, it'll essentially complete my project. Any assistance would go such a long way. I'd be happy to pay hourly to set up a screenshare to walk through it as well if that's preferred as I dont want to take advantage of anyone's time.

Thank you so much!!

def match_names_to_db(fileloc, user, pw, host, db):
    # pull the names to be matched from database
    db_pull = DatabaseUpdater(user=user, pw=pw, host=host, db=db)
    database_names = db_pull.fetch_columns_from_table(table_name='company_directory', column_names=['id', 'company_name'])
    database_names.set_index('id', inplace=True)
    # get names to be matched from Excel file
    tracker_names = data_frame_from_xlsx_range(fileloc, 'tracker_names_to_match')
    tracker_names_unchanged = tracker_names.copy(deep=True)

    # initialize and run name matcher
    matcher = NameMatcher(top_n=50, lowercase=True, punctuations=True, remove_ascii=True, legal_suffixes=True,
                          common_words=True, number_of_matches=5)

    matcher.set_distance_metrics(['overlap',
                                  'weighted_jaccard',
                                  'ratcliff_obershelp',
                                  'fuzzy_wuzzy_token_sort',
                                  'editex',
                                  'discounted_levenshtein'])

    matcher.load_and_process_master_data('company_name', database_names, transform=True)
    matches = matcher.match_names(to_be_matched=tracker_names, column_matching='Tracker_Name')

    # sort the database returned by NameMatcher
    matches.to_excel('test_with_db_pull1.xlsx')
mnijhuis-dnb commented 1 year ago

Thank you for pointing out the error! In the current version the match index is an index running from 0 to the number of rows. So in order to have your index corrected you could use the following line of the matches you got from the code are called matches and your original data is called data matches.match_index = data.index[matches[‘match_index’].astype(int)] In a future version this should be fixed