Closed flixha closed 1 year ago
Fails are coming from the need for the joblib dependency. I'm not averse to adding joblib as a dependency, but if we can avoid it then that would be great. Do you know why joblib has such an advantage over direct multiprocessing? Because the mad function is just calling numpy funcs this could also be a use for multithreading given that numpy should release the GIL.
This would also be fun to write a quick and simple c-func for to compare speed.
Timings from me:
%timeit np.median(np.abs(cccsums), axis=1) * threshold
- note that cccsums
should be a numpy array already I think?)ThreadPoolExecutor
from concurrent.futures
(%timeit executor = ThreadPoolExecutor(); mads = executor.map(_mad, cccsums); executor.shutdown(); mads = [threshold * mad for mad in mads]
ProcessPoolExecutor
(equivalent of multiprocessing): (%timeit executor = ProcessPoolExecutor(); mads = executor.map(_mad, cccsums); executor.shutdown(); mads = [threshold * mad for mad in mads]
)threshold
which shouldn't add much.Based on this and the additional dependency I'm going to suggest a change to using a ThreadPoolExecutor
for this. The joy of numpy releasing the GIL!
I felt that I was searching for alternative solutions and how to best parallelize numpy operations for quite a while - but obviously missed out on how much threadpools actually help because numpy releases the GIL. Your suggestion is of course much cleaner in not needing joblib, and being another 2.5 times faster it's a lot better. So a total speedup of 5x sounds very good, and closer to what I was expecting should be possible here :-). Thanks a lot!
And while the joblib-parallelization quickly saturates with the number of cores because it needs to transfer a lot of memory, the threaded execution continues to scale with more cores - so for a cccsums of 2000000 x 500 elements I got a speedup of 8x over joblib (0.75 s vs 6.1 s), or 15x over serial with 40 cores - awesome!
Great, thanks for merging this! I would love to take advantage of numpy releasing the GIL in more places (and this is one of my internal arguments for redesigning EQcorrscan objects so that they immediately work on numpy array rather than obspy streams, but that is something for when I have more time...).
What does this PR do?
Implements 2.5x speedup for MAD threshold calculation in
core.match_filter.matched_filter
. With this PR, MAD thresholds can be calculated in parallel; with joblib this is already worth it with a relatively small dataset (> 15 cccsum or 2e7 values in cccsums).This has a noticeable effect for larger datasets, e.g., for 2000 templates, this reduced MAD-calculation from 50 s to 20 s with 30 cores.
Why was it initiated? Any relevant Issues?
This PR contributes to the summary issue in https://github.com/eqcorrscan/EQcorrscan/issues/522
Examples for comparing serial and parallel MAD:
PR Checklist
develop
base branch selected?CHANGES.md
. ~- [ ] First time contributors have added your name toCONTRIBUTORS.md
.~