DeNederlandscheBank / name_matching

Other
128 stars 43 forks source link

Matching data indices are not respected #14

Closed mentoc3000 closed 10 months ago

mentoc3000 commented 1 year ago

First, thanks for a great package! I've found it very useful.

I've been using it with some data that has non-sequential indices, which causes the name matching to fail. See the example below. It looks like there's an implicit assumption that the indices of df_companies_a are sequential integers starting from 0.

import pandas as pd
from name_matching.name_matcher import NameMatcher

# define a dataset with bank names
df_companies_a = pd.DataFrame({'Company name': [
        'Industrial and Commercial Bank of China Limited',
        'China Construction Bank',
        'Agricultural Bank of China',
        'Bank of China',
        'JPMorgan Chase',
        'Mitsubishi UFJ Financial Group',
        'Bank of America',
        'HSBC',
        'BNP Paribas',
        'Crédit Agricole']})
df_companies_a["idx"] = range(100, 100 + len(df_companies_a))
df_companies_a = df_companies_a.set_index("idx")

# alter each of the bank names a bit to test the matching
df_companies_b = pd.DataFrame({'name': [
        'Bank of China Limited',
        'Mitsubishi Financial Group',
        'Construction Bank China',
        'Agricultural Bank',
        'Bank of Amerika',
        'BNP Parisbas',
        'JP Morgan Chase',
        'HSCB',
        'Industrial and Commercial Bank of China',
        'Credite Agricole']})
df_companies_b["idx"] = range(200, 200 + len(df_companies_b))
df_companies_b = df_companies_b.set_index("idx")

# initialise the name matcher
matcher = NameMatcher(number_of_matches=1, 
                      legal_suffixes=True, 
                      common_words=False, 
                      top_n=50, 
                      verbose=True)

# adjust the distance metrics to use
matcher.set_distance_metrics(['bag', 'typo', 'refined_soundex'])

# load the data to which the names should be matched
matcher.load_and_process_master_data(column='Company name',
                                     df_matching_data=df_companies_a, 
                                     transform=True)

# perform the name matching on the data you want matched
matches = matcher.match_names(to_be_matched=df_companies_b, 
                              column_matching='name')
print(matches)

# combine the datasets based on the matches
combined = pd.merge(df_companies_a, matches, how='left', left_index=True, right_on='match_index')
combined = pd.merge(combined, df_companies_b, how='left', left_index=True, right_index=True)

print(combined)

I haven't looked into this issue in detail, but might it be caused by flattening the data into _vec to speed up the ngram matching? If that's the case and there's no work around for the indices, a heads up in the documentation would be helpful.

mnijhuis-dnb commented 1 year ago

Yes, indeed that was overlooked, thanks a lot for finding this out! The error occurs when performing the sparse cosine similarity calculation. The data is stored in a sparse matrix without any indexes, the indexes are lost with this step and just a new index starting at 0 is assigned. I think it would be good to alter it in the code to have it returned the actual indexes, rather then the a new row number. I will have a look where would be the best place to substitute the index back in