Closed ParticularMiner closed 3 years ago
Very impressive! I will need some time to test this. Is this the same PR as requested to sparse_dot_topn? Is there an advantage of having this as part of string_grouper as opposed to be part of sparse_dot_topn?
I'm not very fluent in C++ so might not understand everything, but i guess the performance numbers speak for itself.
Btw - earlier i noticed that the sparse_dot_topn algorithm is very similar as the one in scipy (https://github.com/scipy/scipy/blob/master/scipy/sparse/sparsetools/csr.h - see csr_matmat ). Are your changes also potentially useful for scipy itself?
Thanks @Bergvca !
Take as much time as you need.
Is this the same PR as requested to sparse_dot_topn? Is there an advantage of having this as part of string_grouper as opposed to be part of sparse_dot_topn?
Indeed, it is the same PR. But it's been more than three weeks now and I haven't received a substantive reply from ingbank about the PR (I have no idea why), which is why I suggested it be a sub-module of string_grouper
.
I'm not very fluent in C++ so might not understand everything, but i guess the performance numbers speak for itself.
Btw - earlier i noticed that the sparse_dot_topn algorithm is very similar as the one in scipy (https://github.com/scipy/scipy/blob/master/scipy/sparse/sparsetools/csr.h - see csr_matmat ). Are your changes also potentially useful for scipy itself?
Interesting, I'll take a closer look at scipy's code and let you know.
Notify: @Bergvca
In short, this update will make the function
awesome_cossim_topn
ntop
if requested).The table below shows the timing results from this update:
Table of average time-durations of
awesome_cossim_topn
for the old (sparse_dot_topn ) and new (string_grouper_topn) versions over 16 random samples of matrix-pairs A and B.Time is measured in seconds on my 16GB RAM, Intel Core i7 CPU @2.20 GHz laptop.
(old version)
no extra info computed
(new version)
extra info computed
of new version
matrix_pairs = 16
ntop = 4 000 threshold = 0.01 nr_vocab = 263 = 17 576 (the number of columns of A = number of rows of B) density = 30/n_vocab = 0.0017 n_samples = 1 000 000 (the number of rows of A) n_duplicates = 4 000 (the number of columns of B) nnz(A) = 30 000 000 (the number of nonzero components of A) nnz(B) = 120 000 (the number of nonzero components of B)
nnz(A*B) = 188 446 798 ± 44 679 (mean and standard deviation of the total number of nonzero components of A*B above the threshold over 16 random matrix-pairs A and B)
ntop(A*B) = 391 ± 15 (mean and standard deviation of the true maximum number of nonzero components per row of A*B above the threshold over 16 random matrix-pairs A and B)