DeNederlandscheBank / name_matching

Other
128 stars 43 forks source link

What's the best way to match very similar sounding yet different entities? #13

Closed Nirvana2211 closed 1 year ago

Nirvana2211 commented 1 year ago

e.g. If I have "A Energy Production Corporation" and "B Energy Production Corporation", the code will match both with very high score. What's the best way to handle situation like this?

mnijhuis-dnb commented 1 year ago

If you have many of these you can drop common words after calculating the scores. So during the matching process the full name like "A Energy Production Corporation" will still be used, however after this the scores will be adjusted by recalculating the scores without the most common words. In your example, the score would just be calculated on how much "A" and "B" match, as energy, production & corporation are all removed as they are words that occur frequently. There is an option for this in the code, while initializing the NameMatcher you can set the common_words bool to true and use the cut_off_no_scoring_words float to set the limit for removing the most frequently occurring words.

Nirvana2211 commented 1 year ago

Thank you! I will tryout your suggestions.