ParticularMiner commented 3 years ago

Notify: @Bergvca

In short, this update will make the function awesome_cossim_topn

handle larger sparse matrices (potentially more than 100x larger) than the current memory-constrained limit;
run marginally (sometimes significantly) faster than the current version while also
computing extra information about the matrix-product (in particular, it will return the best ntop if requested).
Note: with this update the external package sparse_dot_topn is no longer a dependency of string_grouper.

The table below shows the timing results from this update:

Table of average time-durations of `awesome_cossim_topn` for the old (sparse_dot_topn ) and new (string_grouper_topn) versions over 16 random samples of matrix-pairs A and B.

Time is measured in seconds on my 16GB RAM, Intel Core i7 CPU @2.20 GHz laptop.

#threads	sparse_dot_topn (old version) no extra info computed	string_grouper_topn (new version) extra info computed	± percentage speed-increase of new version
0	48.357178	45.756128	+6%
1	84.805013	49.046593	+73%
2	58.365927	26.116717	+123%
3	48.586614	21.365131	+127%
4	44.523800	18.311739	+143%
5	42.053307	15.756574	+166%
6	40.212938	14.098908	+185%
7	39.031401	13.224633	+195%

matrix_pairs = 16

ntop = 4 000 threshold = 0.01 nr_vocab = 26³ = 17 576 (the number of columns of A = number of rows of B) density = 30/n_vocab = 0.0017 n_samples = 1 000 000 (the number of rows of A) n_duplicates = 4 000 (the number of columns of B) nnz(A) = 30 000 000 (the number of nonzero components of A) nnz(B) = 120 000 (the number of nonzero components of B)

nnz(A*B) = 188 446 798 ± 44 679 (mean and standard deviation of the total number of nonzero components of A*B above the threshold over 16 random matrix-pairs A and B)

ntop(A*B) = 391 ± 15 (mean and standard deviation of the true maximum number of nonzero components per row of A*B above the threshold over 16 random matrix-pairs A and B)

Bergvca commented 3 years ago

Very impressive! I will need some time to test this. Is this the same PR as requested to sparse_dot_topn? Is there an advantage of having this as part of string_grouper as opposed to be part of sparse_dot_topn?

I'm not very fluent in C++ so might not understand everything, but i guess the performance numbers speak for itself.

Btw - earlier i noticed that the sparse_dot_topn algorithm is very similar as the one in scipy (https://github.com/scipy/scipy/blob/master/scipy/sparse/sparsetools/csr.h - see csr_matmat ). Are your changes also potentially useful for scipy itself?

ParticularMiner commented 3 years ago

Thanks @Bergvca !

Take as much time as you need.

Is this the same PR as requested to sparse_dot_topn? Is there an advantage of having this as part of string_grouper as opposed to be part of sparse_dot_topn?

Indeed, it is the same PR. But it's been more than three weeks now and I haven't received a substantive reply from ingbank about the PR (I have no idea why), which is why I suggested it be a sub-module of string_grouper.

I'm not very fluent in C++ so might not understand everything, but i guess the performance numbers speak for itself.

Btw - earlier i noticed that the sparse_dot_topn algorithm is very similar as the one in scipy (https://github.com/scipy/scipy/blob/master/scipy/sparse/sparsetools/csr.h - see csr_matmat ). Are your changes also potentially useful for scipy itself?

Interesting, I'll take a closer look at scipy's code and let you know.

Bergvca / string_grouper

integrated external package sparse_dot_topn into string_grouper as dedicated sub-package string_grouper_topn #55

Table of average time-durations of `awesome_cossim_topn` for the old (sparse_dot_topn ) and new (string_grouper_topn) versions over 16 random samples of matrix-pairs A and B.

matrix_pairs = 16

Bergvca / string_grouper

integrated external package sparse_dot_topn into string_grouper as dedicated sub-package string_grouper_topn #55

Table of average time-durations of awesome_cossim_topn for the old (sparse_dot_topn ) and new (string_grouper_topn) versions over 16 random samples of matrix-pairs A and B.

matrix_pairs = 16

Table of average time-durations of `awesome_cossim_topn` for the old (sparse_dot_topn ) and new (string_grouper_topn) versions over 16 random samples of matrix-pairs A and B.