Implement multiprocessing speedup for `suggest` in absence of python-Levenshtein

bskinn commented 3 years ago

NOTE: With the deprecation of the python-Levenshtein speedup for the suggest functionality (see #211 & #218), identifying other methods to increase performance is a priority. This multi-processing based approach is the best one I've thought of so far. If anyone has another suggestion, please open a new issue to discuss it.

Very early POC on suggest-multiproc branch.

Suggests some speedup possible, but not a slam-dunk whether it's worth it given the sub-2s processing times for most inventories. Properly exploiting difflib.SequenceMatcher's caching behavior may change this, however.

For comparison, python-Levenshtein is a better speedup for less internal complexity, and doesn't appear to benefit at all from multiproc.

Latitude laptop (4 cores), yt (17420 objects)

Quick, rough timing:

>>> import sphobjinv as soi
>>> import timeit
>>> timeit.timeit("soi.Inventory('tests/resource/objects_yt.inv').suggest('ndarray')", globals=globals(), number=1)

NO mp,      WITH lev:  2.177 s
WITH mp(4), WITH lev:  2.396 s
WITH mp(6), WITH lev:  2.355 s

NO mp,      NO lev:   11.795 s
WITH mp(2), NO lev:   10.361 s
WITH mp(3), NO lev:    8.471 s
WITH mp(4), NO lev:    7.583 s (~35% speedup)
WITH mp(5), NO lev:    7.399 s (oversubscribed)
WITH mp(6), NO lev:    8.372 s (oversubscribed)

Notes

Probably can switch to pool.map(), without the context manager
Looks like using multiprocessing.cpu_count() would be a reasonable default pool size
Want user to be able to set max of process count (probably refuse to oversubscribe??)
- API -- arg to Inventory.suggest(), likely
- CLI -- new parameter for suggest subparser
If nproc == 1 then skip multiprocessing entirely
Probably need to check at sphobjinv import-time whether multiprocessing is available
- Doing that in a way that doesn't further slow down the import time may be tricky
Per the difflib docs, implementing a bespoke scoring function directly with difflib.SequenceMatcher may allow significant speed gains, due to the ability to cache the suggest search term
For v2.3, perhaps keep the current single-process fuzzywuzzy method as the default, but implement with a feature flag to switch to the mp-accelerated version?
A naive, direct use of difflib.SequenceMatcher does not give good suggestions for some searches (e.g., "dataobj" in the sphobjinv inventory), compared to the default, WRatio-based partial-matcher in fuzzywuzzy
- Going to want to port this fuzzywuzzy method directly into the codebase, adapted to allow for multiprocess, I think.