Closed iplasma closed 2 years ago
Is it possible to have an example input file? Do you reproduce this bug on small cases?
Just check the same problem exists when running the benchmark (e.g., tst2d_01_plasma_mirror.py)
finalize MPI
--------------------------------------------------------------------------------
Done creating diagnostics, antennas, and external fields
Minimum memory consumption (does not include all temporary buffers)
--------------------------------------------------------------------------------
MPICH Notice [Rank 0] [job id 225372.0] [Thu Jan 6 16:54:38 2022] [nid001000] - Abort(604642563) (rank 0 in comm 0): Fatal error in PMPI_Reduce: Invalid datatype, error stack:
PMPI_Reduce(541): MPI_Reduce(sbuf=MPI_IN_PLACE, rbuf=0x7fffcc9d9410, count=1, datatype=MPI_DATATYPE_NULL, op=MPI_SUM, root=0, comm=MPI_COMM_WORLD) failed
PMPI_Reduce(445): Datatype for argument datatype is a null datatype
My input deck is a bit complex, I will post when I simplify it or I can send it through email if you like.
With Cray, surprisingly, it seems that there is an issue with the MPI_DATATYPE_NULL
.
The error with the Intel environment looks like a time out issue. The code does not seem to crash but Slurm kills the job. Can you send me your compilation workflow and flags?
Thank you.
On that example, the datatype is MPI_LONG_DOUBLE in the source code but MPICH detects it as MPI_DATATYPE_NULL. I checked the MPICH documentation
MPI_LONG_DOUBLE long double (some systems may not implement this)
This is probably why it does not work in your situation.
I do not know what may have been the problem in the first error you reported as I don't know where the bug happened,
Anyways I don't see why we use long double
there. It probably could be changed to double
Here is the compilation workflow. For intel and aocc, it is the same procedure but I just load the corresponding PrgEnv module.
#!/bin/bash
make clean
module purge
module load python/3.8-anaconda-2020.07
module load PrgEnv-cray
module load cray-hdf5-parallel
module list
export HDF5_ROOT_DIR=$HDF5_ROOT
export SMILEICXX=CC
make config="verbose" -j 10
make happi
From what I can tell, it does not use other flags other than the Cray compiler wrapper
CC -D__VERSION=\"4.6-642-g22cdba541-master\" -D_VECTO -std=c++11 -Wall -I/opt/cray/pe/hdf5-parallel/1.12.0.7/CRAYCLANG/10.0/include -Isrc -Isrc/Checkpoint -Isrc/Collisions -Isrc/Diagnostic -Isrc/DomainDecomposition -Isrc/ElectroMagn -Isrc/ElectroMagnBC -Isrc/ElectroMagnSolver -Isrc/Field -Isrc/Interpolator -Isrc/Ionization -Isrc/Merging -Isrc/MovWindow -Isrc/MultiphotonBreitWheeler -Isrc/Params -Isrc/Particles -Isrc/ParticleInjector -Isrc/Patch -Isrc/Profiles -Isrc/Projector -Isrc/Pusher -Isrc/Python -Isrc/Radiation -Isrc/SmileiMPI -Isrc/Species -Isrc/Tools -Isrc/picsar_interface -Isrc/ParticleBC -Ibuild/src/Python -I/usr/projects/hpcsoft/common/x86_64/anaconda/2020.07-python-3.8/include/python3.8 -I/usr/projects/hpcsoft/common/x86_64/anaconda/2020.07-python-3.8/include/python3.8 -I/usr/projects/hpcsoft/common/x86_64/anaconda/2020.07-python-3.8/lib/python3.8/site-packages/numpy/core/include -DSMILEI_USE_NUMPY -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -O3 -g -fopenmp -D_OMP -c src/ElectroMagnSolver/MA_Solver3D_norm.cpp -o build/src/ElectroMagnSolver/MA_Solver3D_norm.o
MPI_LONG_DOUBLE does seem to be the problem of crashing with Cray compiler, I change it to MPI_DOUBLE in src/Patch/VectorPatch.cpp and now the benchmark run is fine.
I will be testing my input deck with Cray compiler and report back. Those jobs are not killed due to time limit, they happened in one or two hours every time I tried and the wallclock limit is 16 hours. Just that at what step it is killed is random. Also I don't see the usual message about running out of allocation time from slurm ( some time there is also message about node failure but not always, that is why I also suspect it is a cluster issue).
Thank you for testing. We will put the correction soon in the master version. For the Intel workflow, you can try to compile in funneled mode by adding to the config no_mpi_tm. For instance:
make config="verbose no_mpi_tm"…
we have seen some issues with the mpi thread multiple mode on Rome systems in the past. The error may be related.
This page can help you: https://smileipic.github.io/Smilei/installation.html
A quick update: with the MPI_LONG_DOUBLE fixed, the code compiled by intel compiler can finish a few runs that run for 6-7 hours without issue. So this is good. But larger runs still experience same problem and the same run that works for intel still crashes with cray compiler (after 2-3 hours) with the following message
37300/85401 4.3676e+01 1.1086e+04 ( 2.8292e+00 ) 560
37310/85401 4.3688e+01 1.1089e+04 ( 2.8121e+00 ) 556
Stack trace (most recent call last):
Stack trace (most recent call last):
#10 Object "/users/pic/src/smilei/github/Smilei/smilei-4.6-ch-cray-py38", at 0x4e65c9, in _start
#9 Object "/lib64/libc.so.6", at 0x7fdb67029349, in __libc_start_main
#8 Object "/users/pic/src/smilei/github/Smilei/smilei-4.6-ch-cray-py38", at 0x83811d, in main
#7 Object "/opt/cray/pe/cce/13.0.0/cce/x86_64/lib/libcraymp.so.1", at 0x7fdb674185df, in _cray$mt_kmpc_fork_call_with_flags
#10 Object "/users/pic/src/smilei/github/Smilei/smilei-4.6-ch-cray-py38", at 0x4e65c9, in _start
#9 Object "/lib64/libc.so.6", at 0x7f5baf5dc349, in __libc_start_main
#8 Object "/users/pic/src/smilei/github/Smilei/smilei-4.6-ch-cray-py38", at 0x83811d, in main
#7 Object "/opt/cray/pe/cce/13.0.0/cce/x86_64/lib/libcraymp.so.1", at 0x7f5baf9cb5df, in _cray$mt_kmpc_fork_call_with_flags
#6 Object "/opt/cray/pe/cce/13.0.0/cce/x86_64/lib/libcraymp.so.1", at 0x7f5baf98f6d8, in
#5 Object "/opt/cray/pe/cce/13.0.0/cce/x86_64/lib/libcraymp.so.1", at 0x7f5baf9d0646, in
#4 Object "/users/pic/src/smilei/github/Smilei/smilei-4.6-ch-cray-py38", at 0x83acff, in
#3 Object "/users/pic/src/smilei/github/Smilei/smilei-4.6-ch-cray-py38", at 0x773627, in VectorPatch::dynamics(Params&, SmileiMPI*, SimWindow*, RadiationTables&, MultiphotonBreitWheelerTables&, double, Timers&, int)
#2 Object "/users/pic/src/smilei/github/Smilei/smilei-4.6-ch-cray-py38", at 0x859e76, in SpeciesV::dynamics(double, unsigned int, ElectroMagn*, Params&, bool, PartWalls*, Patch*, SmileiMPI*, RadiationTables&, MultiphotonBreitWheelerTables&, std::vector<Diagnostic*, std::allocator<Diagnostic*> >&)
#1 Object "/users/pic/src/smilei/github/Smilei/smilei-4.6-ch-cray-py38", at 0x7caad9, in Projector2D2OrderV::currentsAndDensityWrapper(ElectroMagn*, Particles&, SmileiMPI*, int, int, int, bool, bool, int, int, int)
#0 Object "/users/pic/src/smilei/github/Smilei/smilei-4.6-ch-cray-py38", at 0x7cadbd, in Projector2D2OrderV::currents(double*, double*, double*, Particles&, unsigned int, unsigned int, double*, int*, double*, unsigned int, int)
Segmentation fault (Address not mapped to object [0x7ffe230bcff8])
Is this a problem with mpi thread multiple?
This looks like a different issue. Do you have a warning with a CFL problem ?
No warning about CFL, time step is always set to 0.98 of CFL.
Does it work without vectorisation?
I agree that you have 2 differents issues here:
srun: launch/slurm: _step_signal
error right? Is it with the code compiled in funneled mode (make config="verbose no_mpi_tm" -j 10
) or the default mode (make config="verbose" -j 10
)?Note that the initial issue has been fixed. The new segfault remains.
Sorry for the long delay. I found that this is still an issue on Cray. Below is a simple 1D input deck that triggers the same problem both for the aocc and gnu compiled version (the latest one on github).
import numpy as np
twopi = 2*np.pi
twopi2 = twopi**2.0
# ---------- pre-processing -------------------
## global info
box = [800]
res = [200] # resolution in space: cell number per unit length
number_of_patches = [2048]
tmax = 1000 # max sim. time
## processing global info
dr =[ 1./i for i in res ]
dt_CFL = 1./np.sqrt(sum(i**2 for i in res)) # Courant condition
res_t = (1./dt_CFL)/0.98 # resolution in t
dt =1./res_t
Nt = int(res_t) # steps needed to finish one time unit
## adjust box size (cells) to be divided by number_of_pathces
## ** do not change **
for i in range(len(box)):
fac = float(box[i])*res[i]/number_of_patches[i]
if fac!=fac//1: fac = int(fac)+1
box[i] = float(fac)*number_of_patches[i]/res[i]
cells = list( np.array(box)*np.array(res) ) # total number of cells
# ------------ global settings --------------------
Main(
geometry = "1Dcartesian",
interpolation_order = 2,
grid_length = box,
cell_length = dr,
number_of_patches = number_of_patches, # each must be power of 2, total be greater than MPI ranks
timestep = dt,
simulation_time = tmax,
EM_boundary_conditions = [
['silver-muller'],
],
solve_poisson = False, # False to have neutralized background
reference_angular_frequency_SI = 3.e8/1.0e-6, #speed of light / laser wavelegth
print_every = 10,
)
#--------------------- plasma -------------
time_frozen = 20
Species(name = 'ion1', position_initialization = 'regular', momentum_initialization = 'maxwell-juettner', temperature = [0.01/511]*3, particles_per_cell = 100, atomic_number = 6.0, mass = 12*1836.0, charge = 2.0, number_density = 2.0, boundary_conditions = [ ["remove", 'remove'],], time_frozen=time_frozen,)
Species(name = 'ion2', position_initialization = 'regular', momentum_initialization = 'maxwell-juettner', temperature = [0.01/511]*3, particles_per_cell = 100, atomic_number=1.0, mass = 2.0*1836.0, charge = 1.0, number_density = 1.0, boundary_conditions = [ ["remove", 'remove'],], time_frozen=time_frozen,)
Species(name = 'elec',position_initialization = 'regular',momentum_initialization = 'maxwell-juettner', temperature = [0.01/511]*3, particles_per_cell = 100, mass = 1.0, charge = -1.0, charge_density = 3.0, boundary_conditions = [ ["remove", 'remove'],], time_frozen=time_frozen,)
The error message is below
106450/204081 5.2161e+02 1.1579e+03 ( 2.8561e-01 ) 301
106460/204081 5.2166e+02 1.1582e+03 ( 2.4863e-01 ) 262
106470/204081 5.2171e+02 1.2497e+03 ( 9.1497e+01 ) 96522
srun: error: nid001040: tasks 1-4,6-8,10-16,18-63,65-70,72-76,78-127: Killed
srun: launch/slurm: _step_signal: Terminating StepId=629478.0
slurmstepd: error: *** STEP 629478.0 ON nid001040 CANCELLED AT 2022-08-28T18:09:18 ***
srun: error: nid001040: tasks 5,9,17,64,71,77: Killed
srun: error: nid001186: tasks 256-383: Terminated
srun: error: nid001185: tasks 128-255: Terminated
srun: error: nid001187: tasks 384-511: Terminated
srun: error: nid001040: task 0: Killed
srun: Force Terminated StepId=629478.0
I will appreciate any idea on how to fix this.
Do you have any more error message? This looks like you are out of memory. Look at the time for the last iteration: it is much higher than the previous one.
There is no other message. It should not run out of memory, as I am running this on 4 nodes (each with 512GB memory) using 512 MPI ranks.
I just found out that the code compiled and run exactly the same way on another Cray EX (Perlmutter) has no issue with either GNU or aocc compiler. So the problem may be related to the machine, but I wonder if there is a good way to find out.
Unfortunately we cannot investigate for you whether this is a smilei issue or not. It might be related to the filesystem too, or even to details on the mpi installation
Hello again, now I run Smilei under valgrind on Cray EX and also older clusters that I had not encountered this issue. There are many outputs from valgrind like below
==35830==
==35829== Invalid read of size 8
==35829== at 0x84022E: Particles::initialize(unsigned int, Particles&) (Particles.cpp:103)
==35829== by 0x844EC6: Patch::endNbrOfParticles(SmileiMPI*, int, Params&, int, VectorPatch*) (Patch.cpp:921)
==35829== by 0x88F64F: SyncVectorPatch::finalizeExchangeParticles(VectorPatch&, int, int, Params&, SmileiMPI*, Timers&, int) (SyncVectorPatch.cpp:115)
==35829== by 0x88F75C: SyncVectorPatch::finalizeAndSortParticles(VectorPatch&, int, Params&, SmileiMPI*, Timers&, int) (SyncVectorPatch.cpp:52)
==35829== by 0x8A5D4F: VectorPatch::finalizeAndSortParticles(Params&, SmileiMPI*, SimWindow*, double, Timers&, int) (VectorPatch.cpp:474)
==35829== by 0x93163B: main._omp_fn.3 (Smilei.cpp:538)
==35829== by 0x532D4F: main (Smilei.cpp:534)
==35829== Address 0x10888840 is 16 bytes after a block of size 32 in arena "client"
==35829==
4090/204081 2.0043e+01 1.9318e+02 ( 1.2959e+02 ) 2135 ==35833== Invalid read of size 8
==35833== at 0x840223: Particles::initialize(unsigned int, Particles&) (Particles.cpp:102)
==35833== by 0x844EC6: Patch::endNbrOfParticles(SmileiMPI*, int, Params&, int, VectorPatch*) (Patch.cpp:921)
==35833== by 0x88F64F: SyncVectorPatch::finalizeExchangeParticles(VectorPatch&, int, int, Params&, SmileiMPI*, Timers&, int) (SyncVectorPatch.cpp:115)
==35833== by 0x88F75C: SyncVectorPatch::finalizeAndSortParticles(VectorPatch&, int, Params&, SmileiMPI*, Timers&, int) (SyncVectorPatch.cpp:52)
==35833== by 0x8A5D4F: VectorPatch::finalizeAndSortParticles(Params&, SmileiMPI*, SimWindow*, double, Timers&, int) (VectorPatch.cpp:474)
==35833== by 0x93163B: main._omp_fn.3 (Smilei.cpp:538)
==35833== by 0x532D4F: main (Smilei.cpp:534)
==35833== Address 0x10888898 is 0 bytes after a block of size 24 alloc'd
==35833== at 0x4C2B734: operator new(unsigned long) (vg_replace_malloc.c:417)
==35833== by 0x60CE98: allocate (new_allocator.h:104)
==35833== by 0x60CE98: _M_allocate (stl_vector.h:168)
==35833== by 0x60CE98: std::vector<std::vector<double, std::allocator<double> >, std::allocator<std::vector<double, std::allocator<double> > > >::_M_default_append(unsigned long) (vector.tcc:549)
==35833== by 0x83F55B: resize (stl_vector.h:667)
==35833== by 0x83F55B: Particles::resize(unsigned int, unsigned int, bool) (Particles.cpp:179)
==35833== by 0x840200: initialize (Particles.cpp:56)
==35833== by 0x840200: Particles::initialize(unsigned int, Particles&) (Particles.cpp:125)
==35833== by 0x9470DB: Species::initOperators(Params&, Patch*) (Species.cpp:277)
==35833== by 0x8875B7: SpeciesFactory::createVector(Params&, Patch*) (SpeciesFactory.h:1294)
==35833== by 0x8480A6: Patch::finishCreation(Params&, SmileiMPI*, DomainDecomposition*) (Patch.cpp:160)
==35833== by 0x889111: Patch1D::Patch1D(Params&, SmileiMPI*, DomainDecomposition*, unsigned int, unsigned int) (Patch1D.cpp:26)
==35833== by 0x932B83: create (PatchesFactory.h:27)
==35833== by 0x932B83: PatchesFactory::createVector(VectorPatch&, Params&, SmileiMPI*, OpenPMDparams&, RadiationTables*, unsigned int, unsigned int) (PatchesFactory.h:76)
==35833== by 0x533007: main (Smilei.cpp:201)
==35833==
Are these potential memory leaks?
To run valgrind on Smilei, I followed the instructions here
https://pythondev.readthedocs.io/debug_tools.html
by exporting PYTHONMALLOC=malloc
and use the suppression file here (also uncomment the suppressions for _PyObject_Free
and _PyObject_Realloc
.
https://raw.githubusercontent.com/python/cpython/main/Misc/valgrind-python.supp
Then run valgrind --tool=memcheck --suppressions=valgrind-python.supp smilei inp.py
under srun or mpirun.
Hi,
I am running smilei on HPE Cray EX with AMD EPYC processors. If I compile with the Cray compiler in PrgEnv-cray environment, the run will crash quickly with the following message
For the information, here are the only places MPI_Reduce is called with MPI_SUM operation:
If I compile with intel or aocc compiler in PrgEnv-intel or PrgEnv-aocc environment, there is no such error, however, the code will be randomly killed after running for a while. In such case, there is no much useful information from the log
I noticed that the last step before it was killed usually took much longer time than earlier steps, but there should be no intensive I/O at such step (problem is the same even without diagnostic I/O).
However, I verified that the same input deck runs fine on other clusters with older Intel Broadwell processors.
The smilei used is the latest master version on github, but I also tried earlier 4.6 version and observed the same issue.
I wonder if this is an issue of smilei, the compiler or the cluster? To me, it sounds more like the latter two, but the crash with Cray compiler due to null datatype seems concerning and I have no issues with other codes on this cluster.
Thanks.