Closed titoiride closed 5 years ago
The warning is usual. But it might indicate a wrong setup.
We will need more details about the configuration. Could you provide a job submission file ? Or copy-paste the commands you use ?
It could be many reason for lack of performances on Marconi :
KNL are provided for parallelism : running a single core/single thread simulation can't be efficient on this architecture. You can have a look at /proc/cpuinfo
in batch, here an extract of this file from Frioul
, a french KNL system. This is indeed poor regarding your laptop :
model name : Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz
To be efficient on KNL, we advise to use 2 threads per core. Can you check that this is enabled on Marconi ? I heard at the beginning of the Marconi's life that it was not the case.
If the number of particles per cell is low, expected performances are between Haswell and Broadwell performances
Sorry for the late reply, here I try to give you some informations. I hope they're the right ones, otherwise just let me know.
/proc/cpuinfo
output on Marconi is model name : Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz
.About the simulation, I worked with input file beam_2d.py
as you provide it in the tutorial section.
First case, the batch script I use to launch it is
#SBATCH --account=IscrC_ENV-LWFA
#SBATCH --time=00:10:00 # 10 minutes
#SBATCH -N1 -n1 # 1 node, 1 tasks
#SBATCH --error job.err
#SBATCH --output job.out
#SBATCH --partition=knl_usr_prod
mpirun -np 1 /marconi/home/userexternal/dterzani/Pic/Smilei/smilei_test beam_2d.py >> opic1.log 2>> epic1.log
From which I get a successful run with output
___ _ | | _ \ \ Version : v4.1-250-g7f68ece-master
/ __| _ __ (_) | | ___ (_) | |
\__ \ | ' \ _ | | / -_) _ | |
|___/ |_|_|_| |_| |_| \___| |_| | |
/_/
Reading the simulation parameters
--------------------------------------------------------------------------------
HDF5 version 1.8.18
Python version 2.7.15
Parsing pyinit.py
Parsing v4.1-250-g7f68ece-master
Parsing pyprofiles.py
Parsing beam_2d.py
Parsing pycontrol.py
Check for function preprocess()
python preprocess function does not exist
Calling python _smilei_check
Calling python _prepare_checkpoint_dir
[WARNING] Change patches distribution to hilbertian
[WARNING] simulation_time has been redefined from 10.000000 to 9.992303 to match timestep.
Geometry: 2Dcartesian
--------------------------------------------------------------------------------
Interpolation order : 2
Maxwell solver : Yee
(Time resolution, Total simulation time) : (95.273335, 9.992303)
(Total number of iterations, timestep) : (952, 0.010496)
timestep = 0.950000 * CFL
dimension 0 - (Spatial resolution, Grid length) : (64.000000, 4.000000)
- (Number of cells, Cell length) : (256, 0.015625)
- Electromagnetic boundary conditions: (periodic, periodic)
dimension 1 - (Spatial resolution, Grid length) : (64.000000, 4.000000)
- (Number of cells, Cell length) : (256, 0.015625)
- Electromagnetic boundary conditions: (periodic, periodic)
Vectorization:
--------------------------------------------------------------------------------
Mode: off
Patch arrangement :
--------------------------------------------------------------------------------
Initializing MPI
--------------------------------------------------------------------------------
applied topology for periodic BCs in x-direction
applied topology for periodic BCs in y-direction
MPI_THREAD_MULTIPLE enabled
Number of MPI process : 1
Number of patches :
dimension 0 - number_of_patches : 32
dimension 1 - number_of_patches : 32
Patch size :
dimension 0 - n_space : 8 cells.
dimension 1 - n_space : 8 cells.
Dynamic load balancing: never
OpenMP
--------------------------------------------------------------------------------
Number of thread per MPI process : 1
Initializing the restart environment
--------------------------------------------------------------------------------
Initializing moving window
--------------------------------------------------------------------------------
Initializing particles & fields
--------------------------------------------------------------------------------
Creating Species : eon
Creating Species : pon
Laser parameters :
Adding particle walls:
Nothing to do
Initializing Patches
--------------------------------------------------------------------------------
First patch created
Approximately 10% of patches created
Approximately 20% of patches created
Approximately 30% of patches created
Approximately 40% of patches created
Approximately 50% of patches created
Approximately 60% of patches created
Approximately 70% of patches created
Approximately 80% of patches created
Approximately 90% of patches created
All patches created
Creating Diagnostics, antennas, and external fields
--------------------------------------------------------------------------------
Created ParticleBinning diagnostic #0: species eon
Axis x from 0 to 4 in 200 steps
Axis y from 0 to 4 in 200 steps
Created performances diagnostic
Done initializing diagnostics, antennas, and external fields
Applying external fields at time t = 0
--------------------------------------------------------------------------------
Initializing diagnostics
--------------------------------------------------------------------------------
Running diags at time t = 0
--------------------------------------------------------------------------------
Species creation summary
--------------------------------------------------------------------------------
Species 0 (eon) created with 70700 particles
Species 1 (pon) created with 70700 particles
Patch arrangement :
--------------------------------------------------------------------------------
Memory consumption
--------------------------------------------------------------------------------
(Master) Species part = 6 MB
Global Species part = 0.007 GB
Max Species part = 6 MB
(Master) Fields part = 17 MB
Global Fields part = 0.017 GB
Max Fields part = 17 MB
(Master) ParticleBinning0.h5 = 0 MB
Global ParticleBinning0.h5 = 0.000 GB
Max ParticleBinning0.h5 = 0 MB
Expected disk usage (approximate)
--------------------------------------------------------------------------------
WARNING: disk usage by non-uniform particles maybe strongly underestimated,
especially when particles are created at runtime (ionization, pair generation, etc.)
Expected disk usage for diagnostics:
File Performances.h5: 200.48 K
File scalars.txt: 0 bytes
File ParticleBinning0.h5: 29.36 M
Total disk usage for diagnostics: 29.55 M
Cleaning up python runtime environement
--------------------------------------------------------------------------------
Checking for cleanup() function:
python cleanup function does not exist
Calling python _keep_python_running() :
Closing Python
Time-Loop started: number of time-steps n_time = 952
--------------------------------------------------------------------------------
timestep sim time cpu time [s] ( diff [s] )
95/952 1.0024e+00 1.5198e+01 ( 1.5198e+01 )
190/952 1.9995e+00 3.0516e+01 ( 1.5318e+01 )
285/952 2.9966e+00 4.5716e+01 ( 1.5200e+01 )
380/952 3.9938e+00 6.0966e+01 ( 1.5250e+01 )
475/952 4.9909e+00 7.6214e+01 ( 1.5248e+01 )
570/952 5.9880e+00 9.1646e+01 ( 1.5432e+01 )
665/952 6.9852e+00 1.0689e+02 ( 1.5241e+01 )
760/952 7.9823e+00 1.2225e+02 ( 1.5359e+01 )
855/952 8.9794e+00 1.3768e+02 ( 1.5434e+01 )
950/952 9.9766e+00 1.5312e+02 ( 1.5437e+01 )
End time loop, time dual = 9.998
--------------------------------------------------------------------------------
Time profiling : (print time > 0.001%)
--------------------------------------------------------------------------------
Time in time loop : 153.434 96.824% coverage
Particles 66.866 43.580%
Maxwell 13.605 8.867%
Diagnostics 3.990 2.600%
Collisions 0.008 <1%
Sync Particles 46.295 30.173%
Sync Fields 6.250 4.073%
Sync Densities 11.240 7.326%
Printed times are averaged per MPI process
See advanced metrics in profil.txt
Diagnostics profile :
scalars.txt 0.604
ParticleBinning0.h5 1.372
Performances.h5 2.041
END
--------------------------------------------------------------------------------
and error
OMP: Warning #181: OMP_PROC_BIND: ignored because KMP_AFFINITY has been defined
Second case, I use the script
#SBATCH --account=IscrC_ENV-LWFA # put the name of your project
#SBATCH --time=00:05:00 # 10 minutes
#SBATCH -N1 -n16 # 1 node, 16 tasks
#SBATCH --error job.err
#SBATCH --output job.out
#SBATCH --partition=knl_usr_prod
mpirun -np 1 /marconi/home/userexternal/dterzani/Pic/Smilei/smilei beam_2d.py >> opic2.log 2>> epic2.log
which goes really slower than the previous case (in fact, it doesn't finish before the wall time). The output is
_ _
___ _ | | _ \ \ Version : v4.1-250-g7f68ece-master
/ __| _ __ (_) | | ___ (_) | |
\__ \ | ' \ _ | | / -_) _ | |
|___/ |_|_|_| |_| |_| \___| |_| | |
/_/
Reading the simulation parameters
--------------------------------------------------------------------------------
HDF5 version 1.8.18
Python version 2.7.15
Parsing pyinit.py
Parsing v4.1-250-g7f68ece-master
Parsing pyprofiles.py
Parsing beam_2d.py
Parsing pycontrol.py
Check for function preprocess()
python preprocess function does not exist
Calling python _smilei_check
Calling python _prepare_checkpoint_dir
[WARNING] Change patches distribution to hilbertian
[WARNING] simulation_time has been redefined from 10.000000 to 9.992303 to match timestep.
Geometry: 2Dcartesian
--------------------------------------------------------------------------------
Interpolation order : 2
Maxwell solver : Yee
(Time resolution, Total simulation time) : (95.273335, 9.992303)
(Total number of iterations, timestep) : (952, 0.010496)
timestep = 0.950000 * CFL
dimension 0 - (Spatial resolution, Grid length) : (64.000000, 4.000000)
- (Number of cells, Cell length) : (256, 0.015625)
- Electromagnetic boundary conditions: (periodic, periodic)
dimension 1 - (Spatial resolution, Grid length) : (64.000000, 4.000000)
- (Number of cells, Cell length) : (256, 0.015625)
- Electromagnetic boundary conditions: (periodic, periodic)
Vectorization:
--------------------------------------------------------------------------------
Mode: off
Patch arrangement :
--------------------------------------------------------------------------------
Initializing MPI
--------------------------------------------------------------------------------
applied topology for periodic BCs in x-direction
applied topology for periodic BCs in y-direction
MPI_THREAD_MULTIPLE enabled
Number of MPI process : 1
Number of patches :
dimension 0 - number_of_patches : 32
dimension 1 - number_of_patches : 32
Patch size :
dimension 0 - n_space : 8 cells.
dimension 1 - n_space : 8 cells.
Dynamic load balancing: never
OpenMP
--------------------------------------------------------------------------------
Number of thread per MPI process : 16
Initializing the restart environment
--------------------------------------------------------------------------------
Initializing moving window
--------------------------------------------------------------------------------
Initializing particles & fields
--------------------------------------------------------------------------------
Creating Species : eon
Creating Species : pon
Laser parameters :
Adding particle walls:
Nothing to do
Initializing Patches
--------------------------------------------------------------------------------
First patch created
Approximately 10% of patches created
Approximately 20% of patches created
Approximately 30% of patches created
Approximately 40% of patches created
Approximately 50% of patches created
Approximately 60% of patches created
Approximately 70% of patches created
Approximately 80% of patches created
Approximately 90% of patches created
All patches created
Creating Diagnostics, antennas, and external fields
--------------------------------------------------------------------------------
Created ParticleBinning diagnostic #0: species eon
Axis x from 0 to 4 in 200 steps
Axis y from 0 to 4 in 200 steps
Created performances diagnostic
Done initializing diagnostics, antennas, and external fields
Applying external fields at time t = 0
--------------------------------------------------------------------------------
Initializing diagnostics
--------------------------------------------------------------------------------
Running diags at time t = 0
--------------------------------------------------------------------------------
Species creation summary
--------------------------------------------------------------------------------
Species 0 (eon) created with 70700 particles
Species 1 (pon) created with 70700 particles
Patch arrangement :
--------------------------------------------------------------------------------
Memory consumption
--------------------------------------------------------------------------------
(Master) Species part = 6 MB
Global Species part = 0.007 GB
Max Species part = 6 MB
(Master) Fields part = 17 MB
Global Fields part = 0.017 GB
Max Fields part = 17 MB
(Master) ParticleBinning0.h5 = 0 MB
Global ParticleBinning0.h5 = 0.000 GB
Max ParticleBinning0.h5 = 0 MB
Expected disk usage (approximate)
--------------------------------------------------------------------------------
WARNING: disk usage by non-uniform particles maybe strongly underestimated,
especially when particles are created at runtime (ionization, pair generation, etc.)
Expected disk usage for diagnostics:
File Performances.h5: 200.48 K
File scalars.txt: 0 bytes
File ParticleBinning0.h5: 29.36 M
Total disk usage for diagnostics: 29.55 M
Cleaning up python runtime environement
--------------------------------------------------------------------------------
Checking for cleanup() function:
python cleanup function does not exist
Calling python _keep_python_running() :
Closing Python
Time-Loop started: number of time-steps n_time = 952
--------------------------------------------------------------------------------
timestep sim time cpu time [s] ( diff [s] )
95/952 1.0024e+00 1.0907e+02 ( 1.0907e+02 )
190/952 1.9995e+00 2.1902e+02 ( 1.0995e+02 )
and the error the same as before. I wanted to understand if I was setting the simulation wrongly (not optimized for the machine), or if there is some environmental variable already set on Marconi that can prevent a good parallel scaling for Smilei.
Hi,
For the first question, I think that your are looking at the procinfo
of an interactive node which may not be the same that compute ones.
For the behavior with 16 threads, I think that there is confusion in the resource usage. Specifying directly the resources usage for slurm with the ad hoc parameters as below should make srun works correctly :
#SBATCH --account=IscrC_ENV-LWFA # put the name of your project
#SBATCH --time=00:05:00 # 10 minutes
#SBATCH -N 1 # 1 node
#SBATCH -n 1 # 1 MPI
#SBATCH -c 16 # 16 threads per MPI
#SBATCH --error job.err
#SBATCH --output job.out
#SBATCH --partition=knl_usr_prod
srun /marconi/home/userexternal/dterzani/Pic/Smilei/smilei beam_2d.py >> opic2.log 2>> epic2.log
In all case beam_2d.py
is very cheap regarding parallelism (very small and very local).
Hi, For the first question, I think that your are looking at the
procinfo
of an interactive node which may not be the same that compute ones.For the behavior with 16 threads, I think that there is confusion in the resource usage. Specifying directly the resources usage for slurm with the ad hoc parameters as below should make srun works correctly :
#SBATCH --account=IscrC_ENV-LWFA # put the name of your project #SBATCH --time=00:05:00 # 10 minutes #SBATCH -N 1 # 1 node #SBATCH -n 1 # 1 MPI #SBATCH -c 16 # 16 threads per MPI #SBATCH --error job.err #SBATCH --output job.out #SBATCH --partition=knl_usr_prod srun /marconi/home/userexternal/dterzani/Pic/Smilei/smilei beam_2d.py >> opic2.log 2>> epic2.log
Thank you, apparently the problem was to pass the -c
flag correctly. Now I've been doing some tests and it seems that the problem is solved and that both OpenMP and MPI can boost the simulation speed!
When running on Cineca (marconi, knl architecture), I noticed that the code runs, but loses performances whenever parallelized. In particular, I tried the
beam_2d.py
tutorial both on Marconi and on my laptop (as a benchmark). While on my laptop increasing the OMP threads and/or the MPI processes, as explained in the tutorial, works, on Marconi I getOMP: Warning #181: OMP_PROC_BIND: ignored because KMP_AFFINITY has been defined
and the performance worsens increasing either OMP and MPI. For 1 thread and 1 MPI process, simulation time is comparable with my laptop's, increasing even one of them (or both), results in the code being really slow.
May it be somehow related to the warning?