Parallel Efficiency Benchmarks

garyyan123 commented 4 years ago

Hi Denis,

I've been trying out the benchmarks for the 3D Poisson case (3375000 unknowns, 23490000 nonzeros) using OpenMP, basically the C++ code in shared_mem/poisson/amgcl.cpp using AMGCL 1.2.0

The parallel efficiency for the solve time doesn't seem to be that good, about a factor of 2 speedup maximum (to ~6 seconds fastest), when I go from 1 to 6 OpenMP threads, whereas the results on the AMGCL website look like the speedup should be >4x with 6 MPI processes.

Would you have any suggestions on this, or would you have some more details on the architecture/compiler on which the 3D Poisson benchmarks were done, perhaps some of these differences could explain the discrepancy?

Other setup details: -i7-8700 CPU, 6 cores, 64GB memory -Windows, Visual Studio/C compiler 2013/2019 (similar performance), with built in OpenMP -Boost 1.62/1.72 (similar performance) -solver as in the code(smoothed aggregation, spaio, bicgstabl(2))

We're using Python bindings in the final application but we wanted to improve the parallel performance on the C++ side first. Appreciate the help. Regards,

Gary

ddemidov commented 4 years ago

AMGCL, as most iterative methods, is memory-bound. That is, its performance is limited by the available memory bandwidth. I believe the main difference between our results is hardware. The results in amgcl documentation are reported for a dual socket system with two Intel Xeon E5-2640 v3 CPUs. Each of the CPUs has 4 memory channels: Screenshot from 2020-02-26 09-09-04

So the theoretical speedup limit on the system is around 8x. Your CPU has 2 memory channels, which agrees with the 2x speedup factor you observed.

You could also confirm this by running the stream benchmark on your system, which specifically tests the available memory bandwidth.

garyyan123 commented 4 years ago

Thanks for pointing this out; we will pursue this direction further.

ddemidov / amgcl_benchmarks

Parallel Efficiency Benchmarks #3