Bergvca / string_grouper

Super Fast String Matching in Python
MIT License
364 stars 76 forks source link

integrated external package sparse_dot_topn into string_grouper as dedicated sub-package string_grouper_topn #55

Closed ParticularMiner closed 3 years ago

ParticularMiner commented 3 years ago

Notify: @Bergvca

In short, this update will make the function awesome_cossim_topn

  1. handle larger sparse matrices (potentially more than 100x larger) than the current memory-constrained limit;
  2. run marginally (sometimes significantly) faster than the current version while also
  3. computing extra information about the matrix-product (in particular, it will return the best ntop if requested).
  4. Note: with this update the external package sparse_dot_topn is no longer a dependency of string_grouper.

The table below shows the timing results from this update:

Table of average time-durations of awesome_cossim_topn for the old (sparse_dot_topn ) and new (string_grouper_topn) versions over 16 random samples of matrix-pairs A and B.

Time is measured in seconds on my 16GB RAM, Intel Core i7 CPU @2.20 GHz laptop.

#threads sparse_dot_topn
(old version)

no extra info computed
string_grouper_topn
(new version)

extra info computed
± percentage speed-increase
of new version
0 48.357178 45.756128 +6%
1 84.805013 49.046593 +73%
2 58.365927 26.116717 +123%
3 48.586614 21.365131 +127%
4 44.523800 18.311739 +143%
5 42.053307 15.756574 +166%
6 40.212938 14.098908 +185%
7 39.031401 13.224633 +195%

matrix_pairs = 16

ntop = 4 000 threshold = 0.01 nr_vocab = 263 = 17 576 (the number of columns of A = number of rows of B) density = 30/n_vocab = 0.0017 n_samples = 1 000 000 (the number of rows of A) n_duplicates = 4 000 (the number of columns of B) nnz(A) = 30 000 000 (the number of nonzero components of A) nnz(B) = 120 000 (the number of nonzero components of B)

nnz(A*B) = 188 446 798 ± 44 679 (mean and standard deviation of the total number of nonzero components of A*B above the threshold over 16 random matrix-pairs A and B)

ntop(A*B) = 391 ± 15 (mean and standard deviation of the true maximum number of nonzero components per row of A*B above the threshold over 16 random matrix-pairs A and B)

Bergvca commented 3 years ago

Very impressive! I will need some time to test this. Is this the same PR as requested to sparse_dot_topn? Is there an advantage of having this as part of string_grouper as opposed to be part of sparse_dot_topn?

I'm not very fluent in C++ so might not understand everything, but i guess the performance numbers speak for itself.

Btw - earlier i noticed that the sparse_dot_topn algorithm is very similar as the one in scipy (https://github.com/scipy/scipy/blob/master/scipy/sparse/sparsetools/csr.h - see csr_matmat ). Are your changes also potentially useful for scipy itself?

ParticularMiner commented 3 years ago

Thanks @Bergvca !

Take as much time as you need.

Is this the same PR as requested to sparse_dot_topn? Is there an advantage of having this as part of string_grouper as opposed to be part of sparse_dot_topn?

Indeed, it is the same PR. But it's been more than three weeks now and I haven't received a substantive reply from ingbank about the PR (I have no idea why), which is why I suggested it be a sub-module of string_grouper.

I'm not very fluent in C++ so might not understand everything, but i guess the performance numbers speak for itself.

Btw - earlier i noticed that the sparse_dot_topn algorithm is very similar as the one in scipy (https://github.com/scipy/scipy/blob/master/scipy/sparse/sparsetools/csr.h - see csr_matmat ). Are your changes also potentially useful for scipy itself?

Interesting, I'll take a closer look at scipy's code and let you know.