Bergvca / string_grouper

Super Fast String Matching in Python
MIT License
362 stars 76 forks source link

set max_n_matches=1 in match_most_similar() for a significant performance boost #60

Closed ParticularMiner closed 2 years ago

ParticularMiner commented 3 years ago

Hi @Bergvca,

I forgot to mention before: we can achieve a significant performance gain in the function: match_most_similar() by exploiting sparse_dot_topn to do most of the work for string_grouper. For instance, one test I performed using this method on the sec edgar dataset lasted only 11 seconds as opposed to 6 minutes without using this method (a ×33 boost)!

How? Add the following line to the definition of match_most_similar():

kwargs['max_n_matches'] = 1

This allows sparse_dot_topn itself (instead of string_grouper) to directly find the single most similar match in Series master per string in Series duplicates. Afterwards string_grouper needs only to deal with those duplicates that found no match as usual.

Caution: To be able to do this I also needed to swap the argument positions of the two input matrices to the function awesome_cossim_topn, since awesome_cossim_topn only sorts the columns of matrix B and not matrix A in the matrix product A*B. In other words, A should be duplicates_matrix and B should be master_matrix (transposed of course).

The definition and default value of max_n_matches were updated accordingly.