Closed ParticularMiner closed 2 years ago
@Bergvca
By the way, I finally got around to contributing my first scipy code-enhancement as you suggested some time ago in #55. 😉
Hi @ParticularMiner
Many thanks again! I will test it as well obviously but it looks good to me.
I still don't get why this works though. Maybe there is some inefficiency in the algorithm itself? Another thing that could be related is that the awesome_topn library at some point added multi-threading, which was also added to string grouper. I never really noticed a performance boost when using multiple threads though. I haven't really looked into it but feel like the multithreading doesn't work so well, maybe your change somehow improves it?
Hi @Bergvca
You're right — the awesome_topn library certainly runs sub-optimally, at least, beyond a certain data-size range.
I'm trying to pin down the reason why, and after some reading here and there a rough picture is beginning to form in my mind: the chief suspect responsible could be the so-called CPU Cache size — a hardware specification often included in the list of laptop specifications when you buy one. (Mine is 6MB.)
Accessing data from the CPU cache is the fastest memory retrieval mechanism possible during runtime. However, when this cache becomes too full, memory accesses to data not in the cache (termed "cache-misses") happen more often leading to overall slowdown (since in that case, data is instead fetched from RAM which is relatively much slower).
Further study/experimentation is required to confirm this. Moreover, I don't yet know how to use this knowledge to our advantage.
Yes that makes sense - the CPU sees large arrays so puts it in RAM. If it sees a small array it just puts it in Cache directly, or something like that?
Anyway I, i've tested the changes and it looks good. I will merge this branch and upload the latest version to Pypi.
Hi @Bergvca
Following-up on your suggestion, I performed a few more tests and discovered a curious result: the optimum value for the size of the right operand (found earlier to be 8 × 104) does not change appreciably with data, nor with machine. It behaves almost like a melting- or boiling-point temperature, and I don't know why. In particular, I'm not sure if it is an emergent statistical phenomenon or due to a yet obscure property of memory-management in computers. See plot below:
Loglog plots of runtime per string-pair-comparison vs Right Operand Size for fixed Left Operand Size = 104 with option parameter
n_blocks=(1, 1)
for strings in datafilesec__edgar_companies.csv
obtained on different virtual machines with the displayed physical memory limits.I obtained these results on different instances (virtual machines) of Windows Subsystem for Linux version 2 (WSL2) setup on my Windows 10 laptop with different physical memory limits. I also obtained similar results (not shown here) using a larger (> 7 million records) sample data set
companies_sorted.csv
.I have therefore updated the code accordingly, as you will see from the diffs. And this is how the performance looks like after the change:
Loglog plots of runtime per string-pair-comparison vs Right Operand Size for fixed Left Operand Size = 104 with option parameter
n_blocks='guess'
(the default) for strings in datafilesec__edgar_companies.csv
obtained on different virtual machines with the displayed physical memory limits.