Closed enajx closed 3 years ago
Hi 👋 ! Thanks for the issue. I'm offline for a few days but will take a look ASAP. Thanks for the patience!
@enajx I ended up with some time tonight, so I did some tests 😅 .
First off, I can reproduce what you see on macOS. I can think of many some reasons why this might be happening, but I don't know exactly why. That said, I did some experiments, and have some recommendations.
My first thought is that your worker function is very fast to execute, so it's possible that you are dominated by MPI overheads -- either in message passing the large number of tasks, or in the pool waiting for workers to finish before sending new tasks. So, the first experiment I tried was to slow down your worker function by adding a time.sleep(1e-4)
just under the function definition. That made the execution times much closer:
MPI
$ mpiexec -n 4 python demo.py --mpi
Elapsed time n=100: 0.0066378116607666016
Elapsed time n=1000: 0.054388999938964844
Elapsed time n=10000: 0.5381379127502441
Multiprocessing (MPI with -n 4
only uses 3 workers, so it's fairer to compare to 3 cores):
$ python demo.py --ncores=3
Elapsed time n=100: 0.48464012145996094
Elapsed time n=1000: 0.04478788375854492
Elapsed time n=10000: 0.4312679767608642
map
'ingIf your worker function is inherently very fast to execute and you just have a ton of tasks to execute on, I have gotten much better performance by first batching up the tasks and sending batches of tasks to the worker function.
MPI:
$ mpiexec -n 4 python schw-test.py --mpi
Elapsed time n=100: 0.001007080078125
Elapsed time n=1000: 0.003452777862548828
Elapsed time n=10000: 0.03164792060852051
Elapsed time n=100000: 0.29506802558898926
Elapsed time n=1000000: 2.743013858795166
Multiprocessing:
$ python schw-test.py --ncores=3
Elapsed time n=100: 0.4871842861175537
Elapsed time n=1000: 0.001354217529296875
Elapsed time n=10000: 0.006591081619262695
Elapsed time n=100000: 0.07938289642333984
Elapsed time n=1000000: 0.699350118637085
So, MPI is still slower here, but closer than in your initial example (though you could probably get slightly better performance by tuning the number of batches).
That makes sense!
I run a benchmark again with a more realistic worker function and now I get more consistent results, in this case MPI taking the lead:
Computing pi, running on OSX with 8 cores:
n = 1000000 MultiPool : 0.54s MPIPool : 0.12s
n = 10000000 MultiPool : 1.45s MPIPool : 1.22s
import math
import random
import time
import schwimmbad
def sample(num_samples):
num_inside = 0
for _ in range(num_samples):
x, y = random.uniform(-1, 1), random.uniform(-1, 1)
if math.hypot(x, y) <= 1:
num_inside += 1
return num_inside
def approximate_pi_parallel(num_samples, cores, mpi):
sample_batch_size = 1000
with schwimmbad.choose_pool(mpi=mpi, processes=cores) as pool:
print(pool)
start = time.time()
num_inside = 0
sample_batch_size = sample_batch_size
for result in pool.map(sample, [sample_batch_size for _ in range(num_samples//sample_batch_size)]):
num_inside += result
print(f"pi ~= {(4*num_inside)/num_samples}")
print(f"Finished in: {time.time()-start}s")
pool.close()
if __name__ == "__main__":
n = 1000000
# n = 10000000
cores = 8
mpi = False # True when run with mpiexec
approximate_pi_parallel(n, cores, mpi)
Issue solved, thank you!
Hi,
I'm testing the performance of both pools with the given demo script:
These are the result running on a single linux machine with 32 cores with
python script-demo --ncores 32
andmpiexec -n 32 python script-demo.py --mpi
respectively:n = 100000 MultiPool : 0.03s MPIPool : 0.58s
n = 1000000 MultiPool : 0.22s MPIPool : 6.65s
n = 10000000 MultiPool : 2.37s MPIPool : 68.76s
I've also run it on OSX with similar resulting gaps.
I understand that MPI may introduce some extra overhead, but are these large differences to be expected?