Closed Nirvana2211 closed 1 year ago
If you have many of these you can drop common words after calculating the scores. So during the matching process the full name like "A Energy Production Corporation" will still be used, however after this the scores will be adjusted by recalculating the scores without the most common words. In your example, the score would just be calculated on how much "A" and "B" match, as energy, production & corporation are all removed as they are words that occur frequently. There is an option for this in the code, while initializing the NameMatcher you can set the common_words bool to true and use the cut_off_no_scoring_words float to set the limit for removing the most frequently occurring words.
Thank you! I will tryout your suggestions.
e.g. If I have "A Energy Production Corporation" and "B Energy Production Corporation", the code will match both with very high score. What's the best way to handle situation like this?