Open anatesan-stream opened 5 months ago
This could be done by setting the common_words bool to true, the most common words will then be discounted when calculating the score. In the last version ('0.8.10) common_words can also be a list, so you can have a custom set of words that should be discounted.
When constructing the name_matcher the common_words argument can now be used as a list, the words from this list won't count when calculating the score. This can be done as follows:
nm = NameMatcher(common_words=['technology','systems','technologies'])
Many companies in the same domain have common suffixes...
For e.g. in the high tech companies, many companies have words like
For e.g. currently, I have Cisco Systems in the matching data, my string to be matched is Cisco, but the matched score is only 37%. If I can preprocess "Cisco Systems" to "Cisco", I think the match score will be higher.
I think we just need another parameter, in the name_matcher constructor to pass in a custom set of words that will be used in the stripping after the punctuations, white spaces etc. have been removed.