Strange Cpu time usage with multiple MPI processes on tutorial example thermal_plasma_1d.py

dsbertini commented 6 years ago

Hi , I tried to run a simulation example in only one machine. I noticed that when using more than one MPI process the total time needed to compute a full simulation does not scale as expected. I used the tutorial example "thermal_plasma_1d.py" and just change the duration of a simulation from 1024 to 10 (673 steps) I got the following results

                   MPI process         CPU time  [s]                              
                           1                      25
                           2                      12
                           3                      113   ! 
                           4                      118   !

So up to 2 MPI process, the program scales perfectly but as soon as more than 2 MPI process is used., the CPU time increase and the scalability seems to be gone ? What could be the problem here ?

My Machine configuration is: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 56 On-line CPU(s) list: 0-55 Thread(s) per core: 2 Core(s) per socket: 14 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz Stepping: 1 CPU MHz: 1760.250 CPU max MHz: 3300.0000 CPU min MHz: 1200.0000 BogoMIPS: 4801.81 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 35840K NUMA node0 CPU(s): 0-13,28-41 NUMA node1 CPU(s): 14-27,42-55

beck-llr commented 6 years ago

Hi How many openMP threads do you use ? This behaviour is typical of an oversubscribing of the cores of your CPU. Either because you have too many threads per core and use hyperthreading or simply because you pin your openMP threads on the same core. Make sure your openMP threads are correctly pinned. When not specified sometimes the maximum number of openMP threads are used by certain system and that leads to bad performances. If you want to run the code without openMP at all and just run mpi checks you can compile the code with the option config=nopenmp. Let us know how that goes.

jderouillat commented 6 years ago

I confirm Arnaud's comment. You'll find below the behaviour observed on a double E5-2670 architecture (fixing OMP_NUM_THREADS to 1).
It has been run with IntelMPI which is smart regarding the MPI process affinity without specifying it.

Number of MPI	1	2	4	8	16

time loop	25.06	12.53	6.36	3.38	1.86
Particles	24.51	12.22	6.13	3.20	1.70
Maxwell	0.22	0.09	0.05	0.03	0.02
Sync Particles	0.11	0.05	0.04	0.03	0.03
Sync Fields	0.00	0.01	0.01	0.03	0.03
Sync Densities	0.01	0.04	0.06	0.05	0.05

Efficiency	100.00%	100.00%	98.57%	92.68%	84.12%

dsbertini commented 6 years ago

OK you are right, For any reason when OMP_NUM_THREADS is not set, the scalability problem i mentioned in this issue shows up immediately after more than 2 MPI cores. So one has to explicitly set this env. variable and the speed up is clea and linear when using n= 1 or 2 threads but for higher threads number is see no real improvement. May be some other dual MPI - OpenMP effects ?...

beck-llr commented 6 years ago

As I said in my first message, if you do not set your OMP_NUM_THREADS, sometimes the system uses a stupid value. The SMILEI simulation log tells you how many openMP threads are being used so that you can check this.
You should be able to reproduce an almost linear improvement for openMP as well but it is more difficult to achieve because you have to properly pin your threads to the cores.
The first thing we advise is not to use openMP across different sockets. So make sure you allocate at least one MPI process per socket. Then you have to allocate at least as many cores to each of your MPI process as you have openMP threads.
It is often good to bind threads to cores.
And finally make sure your threads are properly positioned across your physical cores and not stack on each other on a same core because of hyperthreading.

All of this is completely independent of SMILEI and is just standard good practice of hybrid codes on many core systems. It is difficult to tell you exactly how to do it because it depends on your environment, compiler, Mpi version etc.

dsbertini commented 6 years ago

thanks a lot for the detailed explanation.

beck-llr commented 6 years ago

Anytime. Hope this will help you get the most out of SMILEI. Let us know if we can be of further help.

SmileiPIC / Smilei

Strange Cpu time usage with multiple MPI processes on tutorial example thermal_plasma_1d.py #36