Huge performance loss on CINECA Marconi-knl

titoiride commented 5 years ago

When running on Cineca (marconi, knl architecture), I noticed that the code runs, but loses performances whenever parallelized. In particular, I tried the beam_2d.py tutorial both on Marconi and on my laptop (as a benchmark). While on my laptop increasing the OMP threads and/or the MPI processes, as explained in the tutorial, works, on Marconi I get

OMP: Warning #181: OMP_PROC_BIND: ignored because KMP_AFFINITY has been defined

and the performance worsens increasing either OMP and MPI. For 1 thread and 1 MPI process, simulation time is comparable with my laptop's, increasing even one of them (or both), results in the code being really slow.

May it be somehow related to the warning?

mccoys commented 5 years ago

The warning is usual. But it might indicate a wrong setup.

We will need more details about the configuration. Could you provide a job submission file ? Or copy-paste the commands you use ?

jderouillat commented 5 years ago

It could be many reason for lack of performances on Marconi :

KNL are provided for parallelism : running a single core/single thread simulation can't be efficient on this architecture. You can have a look at /proc/cpuinfo in batch, here an extract of this file from Frioul, a french KNL system. This is indeed poor regarding your laptop :
```
model name      : Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz
```
To be efficient on KNL, we advise to use 2 threads per core. Can you check that this is enabled on Marconi ? I heard at the beginning of the Marconi's life that it was not the case.
If the number of particles per cell is low, expected performances are between Haswell and Broadwell performances

titoiride commented 5 years ago

Sorry for the late reply, here I try to give you some informations. I hope they're the right ones, otherwise just let me know.

First, the /proc/cpuinfo output on Marconi is model name : Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz.

About the simulation, I worked with input file beam_2d.py as you provide it in the tutorial section.

First case, the batch script I use to launch it is

   #SBATCH --account=IscrC_ENV-LWFA 
   #SBATCH --time=00:10:00               # 10 minutes
   #SBATCH -N1 -n1                 # 1 node, 1 tasks
   #SBATCH --error  job.err
   #SBATCH --output job.out
   #SBATCH --partition=knl_usr_prod

   mpirun -np 1 /marconi/home/userexternal/dterzani/Pic/Smilei/smilei_test beam_2d.py >> opic1.log 2>> epic1.log

From which I get a successful run with output

     ___           _  | |        _  \ \   Version : v4.1-250-g7f68ece-master
    / __|  _ __   (_) | |  ___  (_)  | |
    \__ \ | '  \   _  | | / -_)  _   | |
    |___/ |_|_|_| |_| |_| \___| |_|  | |
                                     /_/

   Reading the simulation parameters
   --------------------------------------------------------------------------------
   HDF5 version 1.8.18
           Python version 2.7.15
           Parsing pyinit.py
           Parsing v4.1-250-g7f68ece-master
           Parsing pyprofiles.py
           Parsing beam_2d.py
           Parsing pycontrol.py
           Check for function preprocess()
           python preprocess function does not exist
           Calling python _smilei_check
           Calling python _prepare_checkpoint_dir
           [WARNING] Change patches distribution to hilbertian
           [WARNING] simulation_time has been redefined from 10.000000 to 9.992303 to match timestep.

   Geometry: 2Dcartesian
   --------------------------------------------------------------------------------
           Interpolation order : 2
           Maxwell solver : Yee
           (Time resolution, Total simulation time) : (95.273335, 9.992303)
           (Total number of iterations,   timestep) : (952, 0.010496)
                      timestep  = 0.950000 * CFL
           dimension 0 - (Spatial resolution, Grid length) : (64.000000, 4.000000)
                       - (Number of cells,    Cell length)  : (256, 0.015625)
                       - Electromagnetic boundary conditions: (periodic, periodic)
           dimension 1 - (Spatial resolution, Grid length) : (64.000000, 4.000000)
                       - (Number of cells,    Cell length)  : (256, 0.015625)
                       - Electromagnetic boundary conditions: (periodic, periodic)

   Vectorization:
   --------------------------------------------------------------------------------
           Mode: off

  Patch arrangement :
   --------------------------------------------------------------------------------

   Initializing MPI
   --------------------------------------------------------------------------------
           applied topology for periodic BCs in x-direction
           applied topology for periodic BCs in y-direction
           MPI_THREAD_MULTIPLE enabled
           Number of MPI process : 1
           Number of patches :
                   dimension 0 - number_of_patches : 32
                   dimension 1 - number_of_patches : 32
           Patch size :
                   dimension 0 - n_space : 8 cells.
                   dimension 1 - n_space : 8 cells.
           Dynamic load balancing: never

   OpenMP
   --------------------------------------------------------------------------------
           Number of thread per MPI process : 1

   Initializing the restart environment
   --------------------------------------------------------------------------------

   Initializing moving window
   --------------------------------------------------------------------------------

   Initializing particles & fields
   --------------------------------------------------------------------------------
           Creating Species : eon
           Creating Species : pon
           Laser parameters :
           Adding particle walls:
                   Nothing to do

   Initializing Patches
   --------------------------------------------------------------------------------
           First patch created
                   Approximately 10% of patches created
                   Approximately 20% of patches created
                   Approximately 30% of patches created
                   Approximately 40% of patches created
                   Approximately 50% of patches created
                   Approximately 60% of patches created
                   Approximately 70% of patches created
                   Approximately 80% of patches created
                   Approximately 90% of patches created
           All patches created

   Creating Diagnostics, antennas, and external fields
   --------------------------------------------------------------------------------
           Created ParticleBinning diagnostic #0: species eon
                   Axis x from 0 to 4 in 200 steps
                   Axis y from 0 to 4 in 200 steps
           Created performances diagnostic
           Done initializing diagnostics, antennas, and external fields

   Applying external fields at time t = 0
   --------------------------------------------------------------------------------

   Initializing diagnostics
   --------------------------------------------------------------------------------

   Running diags at time t = 0
   --------------------------------------------------------------------------------

   Species creation summary
   --------------------------------------------------------------------------------
                   Species 0 (eon) created with 70700 particles
                   Species 1 (pon) created with 70700 particles

   Patch arrangement :
   --------------------------------------------------------------------------------

   Memory consumption
   --------------------------------------------------------------------------------
           (Master) Species part = 6 MB
           Global Species part = 0.007 GB
           Max Species part = 6 MB
           (Master) Fields part = 17 MB
           Global Fields part = 0.017 GB
           Max Fields part = 17 MB
           (Master) ParticleBinning0.h5  = 0 MB
           Global ParticleBinning0.h5 = 0.000 GB
           Max ParticleBinning0.h5 = 0 MB

   Expected disk usage (approximate)
   --------------------------------------------------------------------------------
           WARNING: disk usage by non-uniform particles maybe strongly underestimated,
              especially when particles are created at runtime (ionization, pair generation, etc.)

           Expected disk usage for diagnostics:
                   File Performances.h5: 200.48 K
                   File scalars.txt: 0 bytes
                   File ParticleBinning0.h5: 29.36 M
           Total disk usage for diagnostics: 29.55 M

   Cleaning up python runtime environement
   --------------------------------------------------------------------------------
           Checking for cleanup() function:
           python cleanup function does not exist
           Calling python _keep_python_running() :
                   Closing Python

   Time-Loop started: number of time-steps n_time = 952
   --------------------------------------------------------------------------------
    timestep       sim time   cpu time [s]   (    diff [s] )
      95/952     1.0024e+00     1.5198e+01   (  1.5198e+01 )
     190/952     1.9995e+00     3.0516e+01   (  1.5318e+01 )
     285/952     2.9966e+00     4.5716e+01   (  1.5200e+01 )
     380/952     3.9938e+00     6.0966e+01   (  1.5250e+01 )
     475/952     4.9909e+00     7.6214e+01   (  1.5248e+01 )
     570/952     5.9880e+00     9.1646e+01   (  1.5432e+01 )
     665/952     6.9852e+00     1.0689e+02   (  1.5241e+01 )
     760/952     7.9823e+00     1.2225e+02   (  1.5359e+01 )
     855/952     8.9794e+00     1.3768e+02   (  1.5434e+01 )
     950/952     9.9766e+00     1.5312e+02   (  1.5437e+01 )

   End time loop, time dual = 9.998
   --------------------------------------------------------------------------------

   Time profiling : (print time > 0.001%)
   --------------------------------------------------------------------------------
   Time in time loop :    153.434 96.824% coverage
                     Particles    66.866  43.580%
                       Maxwell    13.605  8.867%
                   Diagnostics    3.990   2.600%
                    Collisions    0.008   <1%
                Sync Particles    46.295  30.173%
                   Sync Fields    6.250   4.073%
                Sync Densities    11.240  7.326%

           Printed times are averaged per MPI process
                   See advanced metrics in profil.txt

          Diagnostics profile :
                           scalars.txt    0.604
                   ParticleBinning0.h5    1.372
                       Performances.h5    2.041

   END
   --------------------------------------------------------------------------------

and error OMP: Warning #181: OMP_PROC_BIND: ignored because KMP_AFFINITY has been defined

Second case, I use the script

     #SBATCH --account=IscrC_ENV-LWFA    # put the name of your project
     #SBATCH --time=00:05:00               # 10 minutes
     #SBATCH -N1 -n16                 # 1 node, 16 tasks
     #SBATCH --error  job.err
     #SBATCH --output job.out
     #SBATCH --partition=knl_usr_prod

     mpirun -np 1 /marconi/home/userexternal/dterzani/Pic/Smilei/smilei beam_2d.py >> opic2.log 2>> epic2.log

which goes really slower than the previous case (in fact, it doesn't finish before the wall time). The output is

                      _            _
    ___           _  | |        _  \ \   Version : v4.1-250-g7f68ece-master
   / __|  _ __   (_) | |  ___  (_)  | |
   \__ \ | '  \   _  | | / -_)  _   | |
   |___/ |_|_|_| |_| |_| \___| |_|  | |
                                   /_/

   Reading the simulation parameters
   --------------------------------------------------------------------------------
   HDF5 version 1.8.18
           Python version 2.7.15
           Parsing pyinit.py
           Parsing v4.1-250-g7f68ece-master
           Parsing pyprofiles.py
           Parsing beam_2d.py
           Parsing pycontrol.py
           Check for function preprocess()
           python preprocess function does not exist
           Calling python _smilei_check
           Calling python _prepare_checkpoint_dir
          [WARNING] Change patches distribution to hilbertian
          [WARNING] simulation_time has been redefined from 10.000000 to 9.992303 to match timestep.

   Geometry: 2Dcartesian
   --------------------------------------------------------------------------------
           Interpolation order : 2
           Maxwell solver : Yee
           (Time resolution, Total simulation time) : (95.273335, 9.992303)
           (Total number of iterations,   timestep) : (952, 0.010496)
                      timestep  = 0.950000 * CFL
           dimension 0 - (Spatial resolution, Grid length) : (64.000000, 4.000000)
                       - (Number of cells,    Cell length)  : (256, 0.015625)
                       - Electromagnetic boundary conditions: (periodic, periodic)
           dimension 1 - (Spatial resolution, Grid length) : (64.000000, 4.000000)
                       - (Number of cells,    Cell length)  : (256, 0.015625)
                       - Electromagnetic boundary conditions: (periodic, periodic)

   Vectorization:
   --------------------------------------------------------------------------------
           Mode: off

   Patch arrangement :
   --------------------------------------------------------------------------------

   Initializing MPI
   --------------------------------------------------------------------------------
           applied topology for periodic BCs in x-direction
           applied topology for periodic BCs in y-direction
           MPI_THREAD_MULTIPLE enabled
           Number of MPI process : 1
           Number of patches :
                   dimension 0 - number_of_patches : 32
                   dimension 1 - number_of_patches : 32
           Patch size :
                   dimension 0 - n_space : 8 cells.
                   dimension 1 - n_space : 8 cells.
           Dynamic load balancing: never

   OpenMP
   --------------------------------------------------------------------------------
           Number of thread per MPI process : 16

   Initializing the restart environment
   --------------------------------------------------------------------------------

   Initializing moving window
   --------------------------------------------------------------------------------

   Initializing particles & fields
   --------------------------------------------------------------------------------
           Creating Species : eon
           Creating Species : pon
           Laser parameters :
           Adding particle walls:
                   Nothing to do

   Initializing Patches
   --------------------------------------------------------------------------------
           First patch created
                   Approximately 10% of patches created
                   Approximately 20% of patches created
                   Approximately 30% of patches created
                   Approximately 40% of patches created
                   Approximately 50% of patches created
                   Approximately 60% of patches created
                   Approximately 70% of patches created
                   Approximately 80% of patches created
                   Approximately 90% of patches created
           All patches created

   Creating Diagnostics, antennas, and external fields
   --------------------------------------------------------------------------------
           Created ParticleBinning diagnostic #0: species eon
                   Axis x from 0 to 4 in 200 steps
                   Axis y from 0 to 4 in 200 steps
           Created performances diagnostic
           Done initializing diagnostics, antennas, and external fields

   Applying external fields at time t = 0
   --------------------------------------------------------------------------------

   Initializing diagnostics
   --------------------------------------------------------------------------------

   Running diags at time t = 0
   --------------------------------------------------------------------------------

   Species creation summary
   --------------------------------------------------------------------------------
                   Species 0 (eon) created with 70700 particles
                   Species 1 (pon) created with 70700 particles

   Patch arrangement :
   --------------------------------------------------------------------------------

   Memory consumption
   --------------------------------------------------------------------------------
           (Master) Species part = 6 MB
           Global Species part = 0.007 GB
           Max Species part = 6 MB
           (Master) Fields part = 17 MB
           Global Fields part = 0.017 GB
           Max Fields part = 17 MB
           (Master) ParticleBinning0.h5  = 0 MB
           Global ParticleBinning0.h5 = 0.000 GB
           Max ParticleBinning0.h5 = 0 MB

   Expected disk usage (approximate)
   --------------------------------------------------------------------------------
           WARNING: disk usage by non-uniform particles maybe strongly underestimated,
              especially when particles are created at runtime (ionization, pair generation, etc.)

           Expected disk usage for diagnostics:
                   File Performances.h5: 200.48 K
                   File scalars.txt: 0 bytes
                   File ParticleBinning0.h5: 29.36 M
           Total disk usage for diagnostics: 29.55 M

   Cleaning up python runtime environement
   --------------------------------------------------------------------------------
           Checking for cleanup() function:
           python cleanup function does not exist
           Calling python _keep_python_running() :
                   Closing Python

   Time-Loop started: number of time-steps n_time = 952
   --------------------------------------------------------------------------------
    timestep       sim time   cpu time [s]   (    diff [s] )
      95/952     1.0024e+00     1.0907e+02   (  1.0907e+02 )
     190/952     1.9995e+00     2.1902e+02   (  1.0995e+02 )

and the error the same as before. I wanted to understand if I was setting the simulation wrongly (not optimized for the machine), or if there is some environmental variable already set on Marconi that can prevent a good parallel scaling for Smilei.

jderouillat commented 5 years ago

Hi, For the first question, I think that your are looking at the procinfo of an interactive node which may not be the same that compute ones.

For the behavior with 16 threads, I think that there is confusion in the resource usage. Specifying directly the resources usage for slurm with the ad hoc parameters as below should make srun works correctly :

   #SBATCH --account=IscrC_ENV-LWFA    # put the name of your project
   #SBATCH --time=00:05:00               # 10 minutes
   #SBATCH -N 1                  # 1 node
   #SBATCH -n 1                   # 1 MPI
   #SBATCH -c 16                 # 16 threads per MPI
   #SBATCH --error  job.err
   #SBATCH --output job.out
   #SBATCH --partition=knl_usr_prod 

   srun /marconi/home/userexternal/dterzani/Pic/Smilei/smilei beam_2d.py >> opic2.log 2>> epic2.log

jderouillat commented 5 years ago

In all case beam_2d.pyis very cheap regarding parallelism (very small and very local).

titoiride commented 5 years ago

Hi, For the first question, I think that your are looking at the procinfo of an interactive node which may not be the same that compute ones.

For the behavior with 16 threads, I think that there is confusion in the resource usage. Specifying directly the resources usage for slurm with the ad hoc parameters as below should make srun works correctly :
   #SBATCH --account=IscrC_ENV-LWFA    # put the name of your project
   #SBATCH --time=00:05:00               # 10 minutes
   #SBATCH -N 1                  # 1 node
   #SBATCH -n 1                   # 1 MPI
   #SBATCH -c 16                 # 16 threads per MPI
   #SBATCH --error  job.err
   #SBATCH --output job.out
   #SBATCH --partition=knl_usr_prod 

   srun /marconi/home/userexternal/dterzani/Pic/Smilei/smilei beam_2d.py >> opic2.log 2>> epic2.log

Thank you, apparently the problem was to pass the -c flag correctly. Now I've been doing some tests and it seems that the problem is solved and that both OpenMP and MPI can boost the simulation speed!

SmileiPIC / Smilei

Huge performance loss on CINECA Marconi-knl #101