SmileiPIC / Smilei

Particle-in-cell code for plasma simulation
https://smileipic.github.io/Smilei
335 stars 119 forks source link

Huge performance loss on CINECA Marconi-knl #101

Closed titoiride closed 5 years ago

titoiride commented 5 years ago

When running on Cineca (marconi, knl architecture), I noticed that the code runs, but loses performances whenever parallelized. In particular, I tried the beam_2d.py tutorial both on Marconi and on my laptop (as a benchmark). While on my laptop increasing the OMP threads and/or the MPI processes, as explained in the tutorial, works, on Marconi I get

OMP: Warning #181: OMP_PROC_BIND: ignored because KMP_AFFINITY has been defined

and the performance worsens increasing either OMP and MPI. For 1 thread and 1 MPI process, simulation time is comparable with my laptop's, increasing even one of them (or both), results in the code being really slow.

May it be somehow related to the warning?

mccoys commented 5 years ago

The warning is usual. But it might indicate a wrong setup.

We will need more details about the configuration. Could you provide a job submission file ? Or copy-paste the commands you use ?

jderouillat commented 5 years ago

It could be many reason for lack of performances on Marconi :

titoiride commented 5 years ago

Sorry for the late reply, here I try to give you some informations. I hope they're the right ones, otherwise just let me know.

jderouillat commented 5 years ago

Hi, For the first question, I think that your are looking at the procinfo of an interactive node which may not be the same that compute ones.

For the behavior with 16 threads, I think that there is confusion in the resource usage. Specifying directly the resources usage for slurm with the ad hoc parameters as below should make srun works correctly :

   #SBATCH --account=IscrC_ENV-LWFA    # put the name of your project
   #SBATCH --time=00:05:00               # 10 minutes
   #SBATCH -N 1                  # 1 node
   #SBATCH -n 1                   # 1 MPI
   #SBATCH -c 16                 # 16 threads per MPI
   #SBATCH --error  job.err
   #SBATCH --output job.out
   #SBATCH --partition=knl_usr_prod 

   srun /marconi/home/userexternal/dterzani/Pic/Smilei/smilei beam_2d.py >> opic2.log 2>> epic2.log
jderouillat commented 5 years ago

In all case beam_2d.pyis very cheap regarding parallelism (very small and very local).

titoiride commented 5 years ago

Hi, For the first question, I think that your are looking at the procinfo of an interactive node which may not be the same that compute ones.

For the behavior with 16 threads, I think that there is confusion in the resource usage. Specifying directly the resources usage for slurm with the ad hoc parameters as below should make srun works correctly :

   #SBATCH --account=IscrC_ENV-LWFA    # put the name of your project
   #SBATCH --time=00:05:00               # 10 minutes
   #SBATCH -N 1                  # 1 node
   #SBATCH -n 1                   # 1 MPI
   #SBATCH -c 16                 # 16 threads per MPI
   #SBATCH --error  job.err
   #SBATCH --output job.out
   #SBATCH --partition=knl_usr_prod 

   srun /marconi/home/userexternal/dterzani/Pic/Smilei/smilei beam_2d.py >> opic2.log 2>> epic2.log

Thank you, apparently the problem was to pass the -c flag correctly. Now I've been doing some tests and it seems that the problem is solved and that both OpenMP and MPI can boost the simulation speed!