I forgot to mention before: we can achieve a significant performance gain in the function: match_most_similar() by exploiting sparse_dot_topn to do most of the work for string_grouper. For instance, one test I performed using this method on the sec edgar dataset lasted only 11 seconds as opposed to 6 minutes without using this method (a ×33 boost)!
How? Add the following line to the definition of match_most_similar():
kwargs['max_n_matches'] = 1
This allows sparse_dot_topn itself (instead of string_grouper) to directly find the single most similar match in Series master per string in Series duplicates. Afterwards string_grouper needs only to deal with those duplicates that found no match as usual.
Caution: To be able to do this I also needed to swap the argument positions of the two input matrices to the function awesome_cossim_topn, since awesome_cossim_topn only sorts the columns of matrix B and not matrix A in the matrix product A*B. In other words, A should be duplicates_matrix and B should be master_matrix (transposed of course).
The definition and default value of max_n_matches were updated accordingly.
Hi @Bergvca,
I forgot to mention before: we can achieve a significant performance gain in the function:
match_most_similar()
by exploitingsparse_dot_topn
to do most of the work forstring_grouper
. For instance, one test I performed using this method on the sec edgar dataset lasted only 11 seconds as opposed to 6 minutes without using this method (a ×33 boost)!How? Add the following line to the definition of
match_most_similar()
:This allows
sparse_dot_topn
itself (instead ofstring_grouper
) to directly find the single most similar match in Seriesmaster
per string in Seriesduplicates
. Afterwardsstring_grouper
needs only to deal with those duplicates that found no match as usual.Caution: To be able to do this I also needed to swap the argument positions of the two input matrices to the function
awesome_cossim_topn
, sinceawesome_cossim_topn
only sorts the columns of matrix B and not matrix A in the matrix product A*B. In other words, A should beduplicates_matrix
and B should bemaster_matrix
(transposed of course).The definition and default value of
max_n_matches
were updated accordingly.