DeNederlandscheBank / name_matching

Other
133 stars 44 forks source link

Allow ability to pass in list of company name suffixes to be stripped after your current preprocessing step. #23

Open anatesan-stream opened 5 months ago

anatesan-stream commented 5 months ago

Many companies in the same domain have common suffixes...
For e.g. in the high tech companies, many companies have words like

For e.g. currently, I have Cisco Systems in the matching data, my string to be matched is Cisco, but the matched score is only 37%. If I can preprocess "Cisco Systems" to "Cisco", I think the match score will be higher.

I think we just need another parameter, in the name_matcher constructor to pass in a custom set of words that will be used in the stripping after the punctuations, white spaces etc. have been removed.

mnijhuis-dnb commented 5 months ago

This could be done by setting the common_words bool to true, the most common words will then be discounted when calculating the score. In the last version ('0.8.10) common_words can also be a list, so you can have a custom set of words that should be discounted.

When constructing the name_matcher the common_words argument can now be used as a list, the words from this list won't count when calculating the score. This can be done as follows: nm = NameMatcher(common_words=['technology','systems','technologies'])