Closed DoubleAgentDave closed 10 months ago
When I use dump_step the code behaves correctly and can be restarted as expected.
I have made some tests showing that dump_minutes is not working properly. We tried some fancy communication pattern but it fails. I tried a much more basic approach and it works so I will push that soon.
In the meantime, do not use dump_minutes
Fixed in branch develop
dump_minutes works inconsistently. On several occasions the simulation dumps from only the first 15 MPI tasks, and then powers of 2 afterwards, ignoring all the other tasks. Obviously the code cannot restart from this situation.
No errors are reported by the code or SLURM.
And additional problem, that is also inconsistent, is that even if the code dumps seemingly correctly it can hang at the first timestep and not quit, just sitting there wasting cluster time :( Also no errors are reported from SLURM or Smilei.
This behaviour happens with several versions of Smilei upto the latest version, the libraries loaded on the cluster where this was tested are: 1) GCC/12.1.0 2) LAPACK/3.10.1 3) BLAS/3.10.0 4) BLIS/0.9 5) Python/3.9.0 6) UCX/1.10.0 7) OpenMPI/4.1.3 8) HDF5/1.12.0
I ran close to 100 of these, this error happened around 15% of the time. The simulations were 1D and ran over 24 nodes with 12 tasks per node, totaling 288 tasks and 2 threads per task. There are usually 8192 patches over the whole simulation, though this does not seem to change the outcome.
The code outputs (I deleted some boring bits):
HDF5 version 1.12.0 Python version 3.9.0 Parsing pyinit.py Parsing ??-?? Parsing pyprofiles.py Parsing broadband_high.py Parsing pycontrol.py Check for function preprocess() python preprocess function does not exist Calling python _smilei_check Calling python _prepare_checkpoint_dir Calling python _keep_python_running() : [1;36mCAREFUL: Patches distribution: hilbertian [0m [;33m WARNING src/Params/Params.cpp:1122 (compute) simulation_time has been redefined from 24167.605015 to 24167.560517 to match timestep.[0m
Geometry: 1Dcartesian
Electromagnetic boundary conditions
Load Balancing:
Vectorization:
[;33m WARNING src/Params/Params.cpp:1245 (check_consistency) Vectorized and scalar algorithms are the same in 1D Cartesian geometry.[0m Calling python writeInfo
Initializing MPI
Initializing the restart environment
Initializing species
Initializing laser parameters
[;33m WARNING src/ElectroMagn/Laser.cpp:92 (Laser) Laser #0: space-time profile defined, dismissing time_envelope space_envelope omega chirp_profile phase [0m Laser #0: space-time profile first component : 1D user-defined function second component : 1D user-defined function
Creating Diagnostics, antennas, and external fields
Minimum memory consumption (does not include all temporary buffers)
Species creation summary
Expected disk usage (approximate)
Keeping or closing the python runtime environment