SmileiPIC / Smilei

Particle-in-cell code for plasma simulation
https://smileipic.github.io/Smilei
335 stars 119 forks source link

dump_minutes failing to correctly dump restart files #673

Closed DoubleAgentDave closed 10 months ago

DoubleAgentDave commented 10 months ago

dump_minutes works inconsistently. On several occasions the simulation dumps from only the first 15 MPI tasks, and then powers of 2 afterwards, ignoring all the other tasks. Obviously the code cannot restart from this situation.

dump times

No errors are reported by the code or SLURM.

And additional problem, that is also inconsistent, is that even if the code dumps seemingly correctly it can hang at the first timestep and not quit, just sitting there wasting cluster time :( Also no errors are reported from SLURM or Smilei.

This behaviour happens with several versions of Smilei upto the latest version, the libraries loaded on the cluster where this was tested are: 1) GCC/12.1.0 2) LAPACK/3.10.1 3) BLAS/3.10.0 4) BLIS/0.9 5) Python/3.9.0 6) UCX/1.10.0 7) OpenMPI/4.1.3 8) HDF5/1.12.0

I ran close to 100 of these, this error happened around 15% of the time. The simulations were 1D and ran over 24 nodes with 12 tasks per node, totaling 288 tasks and 2 threads per task. There are usually 8192 patches over the whole simulation, though this does not seem to change the outcome.

The code outputs (I deleted some boring bits):

HDF5 version 1.12.0 Python version 3.9.0 Parsing pyinit.py Parsing ??-?? Parsing pyprofiles.py Parsing broadband_high.py Parsing pycontrol.py Check for function preprocess() python preprocess function does not exist Calling python _smilei_check Calling python _prepare_checkpoint_dir Calling python _keep_python_running() : CAREFUL: Patches distribution: hilbertian   WARNING src/Params/Params.cpp:1122 (compute) simulation_time has been redefined from 24167.605015 to 24167.560517 to match timestep.

Geometry: 1Dcartesian

 Interpolation order : 2
 Maxwell solver : Yee
 simulation duration = 24167.560517,   total number of iterations = 518250
 timestep = 0.046633 = 0.950000 x CFL,   time resolution = 21.444034
 Grid length: 9650.97
 Cell length: 0.0490874, 0, 0
 Number of cells: 196608
 Spatial resolution: 20.3718
 Cell sorting: activated

Electromagnetic boundary conditions

 xmin silver-muller, absorbing vector [1]
 xmax silver-muller, absorbing vector [-1]

Load Balancing:

 Computational load is initially balanced between MPI ranks. (initial_balance = true) 
 Happens: every 50 iterations
 Cell load coefficient = 1.000000
 Frozen particle load coefficient = 0.100000

Vectorization:

 Mode: adaptive
 Default mode: off
 Time selection: never

 WARNING src/Params/Params.cpp:1245 (check_consistency) Vectorized and scalar algorithms are the same in 1D Cartesian geometry. Calling python writeInfo

Initializing MPI

 MPI_THREAD_MULTIPLE enabled
 Number of MPI processes: 288
 Number of threads per MPI process : 2
 OpenMP task parallelization not activated

 Number of patches: 8192
 Number of cells in one patch: 24
 Dynamic load balancing: every 50 iterations

Initializing the restart environment

 Code will stop after 1380.000000 minutes
 Code will dump every 1380 min, keeping 2 dumps at maximum

Initializing species

 Creating Species #0: eon
     > Pusher: boris
     > Boundary conditions: remove remove
     > Density profile: 1D built-in profile `trapezoidal` (value: 0.050000, xvacuum: 942.477796, xplateau: 7451.857774, xslope1: 628.318531, xslope2: 628.318531)

 Creating Species #1: ion
     > Pusher: boris
     > Boundary conditions: remove remove
     > Density profile: 1D built-in profile `trapezoidal` (value: 0.050000, xvacuum: 942.477796, xplateau: 7451.857774, xslope1: 628.318531, xslope2: 628.318531)

Initializing laser parameters

 WARNING src/ElectroMagn/Laser.cpp:92 (Laser) Laser #0: space-time profile defined, dismissing time_envelope space_envelope omega chirp_profile phase  Laser #0: space-time profile first component : 1D user-defined function second component : 1D user-defined function

 Binary processes #0 within species (0)
     1. Collisions with Coulomb logarithm: auto

 Binary processes #1 within species (1)
     1. Collisions with Coulomb logarithm: auto

 Binary processes #2 between species (0) and (1)
     1. Collisions with Coulomb logarithm: auto

Creating Diagnostics, antennas, and external fields

 Created ParticleBinning #0: species ion
     Axis x from 0 to 9650.97 in 49152 steps
     Axis vx from -0.01 to 0.01 in 1000 steps
 Created ParticleBinning #1: species eon
     Axis x from 0 to 9650.97 in 49152 steps
     Axis vx from -0.5 to 0.5 in 1000 steps
 Created ParticleBinning #2: species ion
     Axis vx from -0.01 to 0.01 in 1000 steps
 Created ParticleBinning #3: species eon
     Axis vx from -0.5 to 0.5 in 1000 steps
 Created ParticleBinning #4: species ion
     Axis ekin from 0.0001 to 1 in 2500 steps [LOGSCALE] 
 Created ParticleBinning #5: species eon
     Axis ekin from 0.0001 to 1 in 2500 steps [LOGSCALE] 
 Diagnostic Fields #0  :
     Ex Ey Ez Bx By Bz Rho_ion Rho_eon Jx Jy Jz 
 Probe diagnostic #0
     10 points
     origin : 0.245437
     corner 0 : 9650.73

Minimum memory consumption (does not include all temporary buffers)

          Particles: Master 96 MB;   Max 99 MB;   Global 27.8 GB
             Fields: Master 3 MB;   Max 3 MB;   Global 0.0372 GB
        scalars.txt: Master 0 MB;   Max 0 MB;   Global 0 GB
ParticleBinning0.h5: Master 375 MB;   Max 375 MB;   Global 105 GB
ParticleBinning1.h5: Master 375 MB;   Max 375 MB;   Global 105 GB
ParticleBinning2.h5: Master 0 MB;   Max 0 MB;   Global 0.00215 GB
ParticleBinning3.h5: Master 0 MB;   Max 0 MB;   Global 0.00215 GB
ParticleBinning4.h5: Master 0 MB;   Max 0 MB;   Global 0.00536 GB
ParticleBinning5.h5: Master 0 MB;   Max 0 MB;   Global 0.00536 GB
         Fields0.h5: Master 0 MB;   Max 0 MB;   Global 0 GB
         Probes0.h5: Master 0 MB;   Max 0 MB;   Global 1.73e-06 GB

Species creation summary

     Species 0 (eon) created with 177408000 particles
     Species 1 (ion) created with 177408000 particles

Expected disk usage (approximate)

 WARNING: disk usage by non-uniform particles maybe strongly underestimated,
    especially when particles are created at runtime (ionization, pair generation, etc.)

 Expected disk usage for diagnostics:
     File Fields0.h5: 3.35 G
     File Probes0.h5: 239.96 M
     File scalars.txt: 19.77 M
     File ParticleBinning0.h5: 76.17 G
     File ParticleBinning1.h5: 76.17 G
     File ParticleBinning2.h5: 1.72 M
     File ParticleBinning3.h5: 1.72 M
     File ParticleBinning4.h5: 4.10 M
     File ParticleBinning5.h5: 4.10 M
 Total disk usage for diagnostics: 155.96 G

 Expected disk usage for each checkpoint:
     For fields: 16.31 M
     For particles: 13.88 G
     For diagnostics: 0 bytes
 Total disk usage for one checkpoint: 13.90 G

Keeping or closing the python runtime environment

 Checking for cleanup() function:
 python cleanup function does not exist
 Keeping Python interpreter alive
DoubleAgentDave commented 10 months ago

When I use dump_step the code behaves correctly and can be restarted as expected.

mccoys commented 10 months ago

I have made some tests showing that dump_minutes is not working properly. We tried some fancy communication pattern but it fails. I tried a much more basic approach and it works so I will push that soon.

In the meantime, do not use dump_minutes

mccoys commented 10 months ago

Fixed in branch develop