Problem on Cray EX - Githubissues

SmileiPIC / Smilei

Particle-in-cell code for plasma simulation

https://smileipic.github.io/Smilei

345 stars 121 forks source link

Problem on Cray EX #492

Closed iplasma closed 2 years ago

iplasma commented 2 years ago

Hi,

I am running smilei on HPE Cray EX with AMD EPYC processors. If I compile with the Cray compiler in PrgEnv-cray environment, the run will crash quickly with the following message

aborting job:
Fatal error in PMPI_Reduce: Invalid datatype, error stack:
PMPI_Reduce(541): MPI_Reduce(sbuf=0x7ffdfc77f010, rbuf=0x7ffdfc77f010, count=1, datatype=MPI_DATATYPE_NULL, op=MPI_SUM, root=0, comm=MPI_COMM_WORLD) failed
PMPI_Reduce(445): Datatype for argument datatype is a null datatype
MPICH Notice [Rank 0] [job id 225340.0] [Thu Jan  6 14:24:47 2022] [nid001579] - Abort(134880515) (rank 0 in comm 0): Fatal error in PMPI_Reduce: Invalid datatype, error stack:
PMPI_Reduce(541): MPI_Reduce(sbuf=MPI_IN_PLACE, rbuf=0x7ffde3bde810, count=1, datatype=MPI_DATATYPE_NULL, op=MPI_SUM, root=0, comm=MPI_COMM_WORLD) failed
PMPI_Reduce(445): Datatype for argument datatype is a null datatype

For the information, here are the only places MPI_Reduce is called with MPI_SUM operation:

src/ElectroMagn/LaserPropagator.cpp:318:        MPI_Reduce( &local_spectrum[0], &spectrum[0], lmax, MPI_DOUBLE, MPI_SUM, 0, comm_ );
src/Patch/VectorPatch.h:352:            MPI_Reduce( &( nParticles[ispec] ), &tmp, 1, MPI_UINT64_T, MPI_SUM, 0, smpi->world() );
src/Patch/VectorPatch.cpp:68:        MPI_Reduce( &diag_timers[idiag]->time_acc_, &sum, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD );
src/Patch/VectorPatch.cpp:3898:    MPI_Reduce( smpi->isMaster()?MPI_IN_PLACE:&globalData, &globalData, 1, MPI_LONG_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD );
src/SmileiMPI/SmileiMPI.h:251:        MPI_Reduce( &locNbrParticles, &nParticles, 1, MPI_INT, MPI_SUM, 0, world_ );
src/SmileiMPI/SmileiMPI.cpp:324:    MPI_Reduce( &total_load, &Tload, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD );
src/SmileiMPI/SmileiMPI.cpp:1601:    MPI_Reduce( isMaster()?MPI_IN_PLACE:d_sum, d_sum, n_sum, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD );
src/SmileiMPI/SmileiMPI.cpp:1684:        MPI_Reduce( diagParticles->filename.size()?MPI_IN_PLACE:&diagParticles->data_sum[0], &diagParticles->data_sum[0], diagParticles->output_size, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD );
src/SmileiMPI/SmileiMPI.cpp:1698:        MPI_Reduce( diagScreen->filename.size()?MPI_IN_PLACE:&diagScreen->data_sum[0], &diagScreen->data_sum[0], diagScreen->output_size, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD );
src/SmileiMPI/SmileiMPI.cpp:1712:        MPI_Reduce( diagRad->filename.size()?MPI_IN_PLACE:&diagRad->data_sum[0], &diagRad->data_sum[0], diagRad->output_size, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD );

If I compile with intel or aocc compiler in PrgEnv-intel or PrgEnv-aocc environment, there is no such error, however, the code will be randomly killed after running for a while. In such case, there is no much useful information from the log

69800/213512     3.2691e+01     1.1102e+04   (  4.0116e+00 )            1290
69810/213512     3.2696e+01     1.1160e+04   (  5.8392e+01 )           18786
srun: error: nid001166: tasks 1-127: Killed
srun: launch/slurm: _step_signal: Terminating StepId=225257.0
slurmstepd: error: *** STEP 225257.0 ON nid001166 CANCELLED AT 2022-01-06T13:50:26 ***

I noticed that the last step before it was killed usually took much longer time than earlier steps, but there should be no intensive I/O at such step (problem is the same even without diagnostic I/O).

However, I verified that the same input deck runs fine on other clusters with older Intel Broadwell processors.

The smilei used is the latest master version on github, but I also tried earlier 4.6 version and observed the same issue.

I wonder if this is an issue of smilei, the compiler or the cluster? To me, it sounds more like the latter two, but the crash with Cray compiler due to null datatype seems concerning and I have no issues with other codes on this cluster.

Thanks.

mccoys commented 2 years ago

Is it possible to have an example input file? Do you reproduce this bug on small cases?

iplasma commented 2 years ago

Just check the same problem exists when running the benchmark (e.g., tst2d_01_plasma_mirror.py)

 finalize MPI
 --------------------------------------------------------------------------------
         Done creating diagnostics, antennas, and external fields

 Minimum memory consumption (does not include all temporary buffers)
 --------------------------------------------------------------------------------
MPICH Notice [Rank 0] [job id 225372.0] [Thu Jan  6 16:54:38 2022] [nid001000] - Abort(604642563) (rank 0 in comm 0): Fatal error in PMPI_Reduce: Invalid datatype, error stack:
PMPI_Reduce(541): MPI_Reduce(sbuf=MPI_IN_PLACE, rbuf=0x7fffcc9d9410, count=1, datatype=MPI_DATATYPE_NULL, op=MPI_SUM, root=0, comm=MPI_COMM_WORLD) failed
PMPI_Reduce(445): Datatype for argument datatype is a null datatype

My input deck is a bit complex, I will post when I simplify it or I can send it through email if you like.

xxirii commented 2 years ago

With Cray, surprisingly, it seems that there is an issue with the MPI_DATATYPE_NULL.

The error with the Intel environment looks like a time out issue. The code does not seem to crash but Slurm kills the job. Can you send me your compilation workflow and flags?

Thank you.

mccoys commented 2 years ago

On that example, the datatype is MPI_LONG_DOUBLE in the source code but MPICH detects it as MPI_DATATYPE_NULL. I checked the MPICH documentation

MPI_LONG_DOUBLE long double (some systems may not implement this)

This is probably why it does not work in your situation.

I do not know what may have been the problem in the first error you reported as I don't know where the bug happened,

Anyways I don't see why we use long double there. It probably could be changed to double

iplasma commented 2 years ago

Here is the compilation workflow. For intel and aocc, it is the same procedure but I just load the corresponding PrgEnv module.

#!/bin/bash

make clean
module purge
module load python/3.8-anaconda-2020.07
module load PrgEnv-cray
module load cray-hdf5-parallel
module list
export HDF5_ROOT_DIR=$HDF5_ROOT
export SMILEICXX=CC

make config="verbose" -j 10
make happi

From what I can tell, it does not use other flags other than the Cray compiler wrapper

CC -D__VERSION=\"4.6-642-g22cdba541-master\" -D_VECTO -std=c++11 -Wall -I/opt/cray/pe/hdf5-parallel/1.12.0.7/CRAYCLANG/10.0/include -Isrc -Isrc/Checkpoint -Isrc/Collisions -Isrc/Diagnostic -Isrc/DomainDecomposition -Isrc/ElectroMagn -Isrc/ElectroMagnBC -Isrc/ElectroMagnSolver -Isrc/Field -Isrc/Interpolator -Isrc/Ionization -Isrc/Merging -Isrc/MovWindow -Isrc/MultiphotonBreitWheeler -Isrc/Params -Isrc/Particles -Isrc/ParticleInjector -Isrc/Patch -Isrc/Profiles -Isrc/Projector -Isrc/Pusher -Isrc/Python -Isrc/Radiation -Isrc/SmileiMPI -Isrc/Species -Isrc/Tools -Isrc/picsar_interface -Isrc/ParticleBC -Ibuild/src/Python -I/usr/projects/hpcsoft/common/x86_64/anaconda/2020.07-python-3.8/include/python3.8 -I/usr/projects/hpcsoft/common/x86_64/anaconda/2020.07-python-3.8/include/python3.8 -I/usr/projects/hpcsoft/common/x86_64/anaconda/2020.07-python-3.8/lib/python3.8/site-packages/numpy/core/include -DSMILEI_USE_NUMPY -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -O3 -g -fopenmp -D_OMP -c src/ElectroMagnSolver/MA_Solver3D_norm.cpp -o build/src/ElectroMagnSolver/MA_Solver3D_norm.o

MPI_LONG_DOUBLE does seem to be the problem of crashing with Cray compiler, I change it to MPI_DOUBLE in src/Patch/VectorPatch.cpp and now the benchmark run is fine.

I will be testing my input deck with Cray compiler and report back. Those jobs are not killed due to time limit, they happened in one or two hours every time I tried and the wallclock limit is 16 hours. Just that at what step it is killed is random. Also I don't see the usual message about running out of allocation time from slurm ( some time there is also message about node failure but not always, that is why I also suspect it is a cluster issue).

xxirii commented 2 years ago

Thank you for testing. We will put the correction soon in the master version. For the Intel workflow, you can try to compile in funneled mode by adding to the config no_mpi_tm. For instance:

make config="verbose no_mpi_tm"…

we have seen some issues with the mpi thread multiple mode on Rome systems in the past. The error may be related.

This page can help you: https://smileipic.github.io/Smilei/installation.html

iplasma commented 2 years ago

A quick update: with the MPI_LONG_DOUBLE fixed, the code compiled by intel compiler can finish a few runs that run for 6-7 hours without issue. So this is good. But larger runs still experience same problem and the same run that works for intel still crashes with cray compiler (after 2-3 hours) with the following message

   37300/85401     4.3676e+01     1.1086e+04   (  2.8292e+00 )             560
   37310/85401     4.3688e+01     1.1089e+04   (  2.8121e+00 )             556
Stack trace (most recent call last):
Stack trace (most recent call last):
#10   Object "/users/pic/src/smilei/github/Smilei/smilei-4.6-ch-cray-py38", at 0x4e65c9, in _start
#9    Object "/lib64/libc.so.6", at 0x7fdb67029349, in __libc_start_main
#8    Object "/users/pic/src/smilei/github/Smilei/smilei-4.6-ch-cray-py38", at 0x83811d, in main
#7    Object "/opt/cray/pe/cce/13.0.0/cce/x86_64/lib/libcraymp.so.1", at 0x7fdb674185df, in _cray$mt_kmpc_fork_call_with_flags
#10   Object "/users/pic/src/smilei/github/Smilei/smilei-4.6-ch-cray-py38", at 0x4e65c9, in _start
#9    Object "/lib64/libc.so.6", at 0x7f5baf5dc349, in __libc_start_main
#8    Object "/users/pic/src/smilei/github/Smilei/smilei-4.6-ch-cray-py38", at 0x83811d, in main
#7    Object "/opt/cray/pe/cce/13.0.0/cce/x86_64/lib/libcraymp.so.1", at 0x7f5baf9cb5df, in _cray$mt_kmpc_fork_call_with_flags
#6    Object "/opt/cray/pe/cce/13.0.0/cce/x86_64/lib/libcraymp.so.1", at 0x7f5baf98f6d8, in
#5    Object "/opt/cray/pe/cce/13.0.0/cce/x86_64/lib/libcraymp.so.1", at 0x7f5baf9d0646, in
#4    Object "/users/pic/src/smilei/github/Smilei/smilei-4.6-ch-cray-py38", at 0x83acff, in
#3    Object "/users/pic/src/smilei/github/Smilei/smilei-4.6-ch-cray-py38", at 0x773627, in VectorPatch::dynamics(Params&, SmileiMPI*, SimWindow*, RadiationTables&, MultiphotonBreitWheelerTables&, double, Timers&, int)
#2    Object "/users/pic/src/smilei/github/Smilei/smilei-4.6-ch-cray-py38", at 0x859e76, in SpeciesV::dynamics(double, unsigned int, ElectroMagn*, Params&, bool, PartWalls*, Patch*, SmileiMPI*, RadiationTables&, MultiphotonBreitWheelerTables&, std::vector<Diagnostic*, std::allocator<Diagnostic*> >&)
#1    Object "/users/pic/src/smilei/github/Smilei/smilei-4.6-ch-cray-py38", at 0x7caad9, in Projector2D2OrderV::currentsAndDensityWrapper(ElectroMagn*, Particles&, SmileiMPI*, int, int, int, bool, bool, int, int, int)
#0    Object "/users/pic/src/smilei/github/Smilei/smilei-4.6-ch-cray-py38", at 0x7cadbd, in Projector2D2OrderV::currents(double*, double*, double*, Particles&, unsigned int, unsigned int, double*, int*, double*, unsigned int, int)
Segmentation fault (Address not mapped to object [0x7ffe230bcff8])

Is this a problem with mpi thread multiple?

mccoys commented 2 years ago

This looks like a different issue. Do you have a warning with a CFL problem ?

iplasma commented 2 years ago

No warning about CFL, time step is always set to 0.98 of CFL.

mccoys commented 2 years ago

Does it work without vectorisation?

xxirii commented 2 years ago

I agree that you have 2 differents issues here:

There is a possible issue with the Cray compiler in vector mode. We must admit that the code is not often tested with the Cray compiler. This error is not related to the MPI THREAD_MULTIPLE mode.
For large runs, you still have the srun: launch/slurm: _step_signal error right? Is it with the code compiled in funneled mode (make config="verbose no_mpi_tm" -j 10) or the default mode (make config="verbose" -j 10)?

mccoys commented 2 years ago

Note that the initial issue has been fixed. The new segfault remains.

iplasma commented 2 years ago

Sorry for the long delay. I found that this is still an issue on Cray. Below is a simple 1D input deck that triggers the same problem both for the aocc and gnu compiled version (the latest one on github).

import numpy as np
twopi = 2*np.pi
twopi2 = twopi**2.0

# ---------- pre-processing -------------------
## global info
box = [800]
res = [200]   # resolution in space: cell number per unit length
number_of_patches = [2048]
tmax = 1000  # max sim. time
## processing global info
dr =[ 1./i for i in res ]
dt_CFL = 1./np.sqrt(sum(i**2 for i in res))  # Courant condition
res_t = (1./dt_CFL)/0.98  # resolution in t
dt =1./res_t
Nt = int(res_t) # steps needed to finish one time unit

## adjust box size (cells) to be divided by number_of_pathces
##   ** do not change **
for i in range(len(box)):
        fac = float(box[i])*res[i]/number_of_patches[i]
        if fac!=fac//1: fac = int(fac)+1
        box[i] = float(fac)*number_of_patches[i]/res[i]
cells = list( np.array(box)*np.array(res) ) # total number of cells

# ------------ global settings --------------------
Main(
    geometry = "1Dcartesian",
    interpolation_order = 2,
    grid_length  = box,
    cell_length = dr,
    number_of_patches = number_of_patches, # each must be power of 2, total be greater than MPI ranks
    timestep = dt,
    simulation_time = tmax,
    EM_boundary_conditions = [
        ['silver-muller'],
    ],
        solve_poisson = False,  # False to have neutralized background
        reference_angular_frequency_SI = 3.e8/1.0e-6, #speed of light / laser wavelegth
        print_every = 10,
)

#--------------------- plasma -------------
time_frozen = 20
Species(name = 'ion1', position_initialization = 'regular', momentum_initialization = 'maxwell-juettner', temperature = [0.01/511]*3, particles_per_cell = 100, atomic_number = 6.0, mass = 12*1836.0, charge = 2.0, number_density = 2.0, boundary_conditions = [ ["remove", 'remove'],], time_frozen=time_frozen,)

Species(name = 'ion2', position_initialization = 'regular', momentum_initialization = 'maxwell-juettner', temperature = [0.01/511]*3, particles_per_cell = 100, atomic_number=1.0, mass = 2.0*1836.0, charge = 1.0, number_density = 1.0, boundary_conditions = [ ["remove", 'remove'],], time_frozen=time_frozen,)

Species(name = 'elec',position_initialization = 'regular',momentum_initialization = 'maxwell-juettner', temperature = [0.01/511]*3, particles_per_cell = 100, mass = 1.0, charge = -1.0, charge_density = 3.0, boundary_conditions = [ ["remove", 'remove'],], time_frozen=time_frozen,)

The error message is below

   106450/204081     5.2161e+02     1.1579e+03   (  2.8561e-01 )             301
   106460/204081     5.2166e+02     1.1582e+03   (  2.4863e-01 )             262
   106470/204081     5.2171e+02     1.2497e+03   (  9.1497e+01 )           96522
srun: error: nid001040: tasks 1-4,6-8,10-16,18-63,65-70,72-76,78-127: Killed
srun: launch/slurm: _step_signal: Terminating StepId=629478.0
slurmstepd: error: *** STEP 629478.0 ON nid001040 CANCELLED AT 2022-08-28T18:09:18 ***
srun: error: nid001040: tasks 5,9,17,64,71,77: Killed
srun: error: nid001186: tasks 256-383: Terminated
srun: error: nid001185: tasks 128-255: Terminated
srun: error: nid001187: tasks 384-511: Terminated
srun: error: nid001040: task 0: Killed
srun: Force Terminated StepId=629478.0

I will appreciate any idea on how to fix this.

mccoys commented 2 years ago

Do you have any more error message? This looks like you are out of memory. Look at the time for the last iteration: it is much higher than the previous one.

iplasma commented 2 years ago

There is no other message. It should not run out of memory, as I am running this on 4 nodes (each with 512GB memory) using 512 MPI ranks.

I just found out that the code compiled and run exactly the same way on another Cray EX (Perlmutter) has no issue with either GNU or aocc compiler. So the problem may be related to the machine, but I wonder if there is a good way to find out.

mccoys commented 2 years ago

Unfortunately we cannot investigate for you whether this is a smilei issue or not. It might be related to the filesystem too, or even to details on the mpi installation

iplasma commented 2 years ago

Hello again, now I run Smilei under valgrind on Cray EX and also older clusters that I had not encountered this issue. There are many outputs from valgrind like below

==35830==
==35829== Invalid read of size 8
==35829==    at 0x84022E: Particles::initialize(unsigned int, Particles&) (Particles.cpp:103)
==35829==    by 0x844EC6: Patch::endNbrOfParticles(SmileiMPI*, int, Params&, int, VectorPatch*) (Patch.cpp:921)
==35829==    by 0x88F64F: SyncVectorPatch::finalizeExchangeParticles(VectorPatch&, int, int, Params&, SmileiMPI*, Timers&, int) (SyncVectorPatch.cpp:115)
==35829==    by 0x88F75C: SyncVectorPatch::finalizeAndSortParticles(VectorPatch&, int, Params&, SmileiMPI*, Timers&, int) (SyncVectorPatch.cpp:52)
==35829==    by 0x8A5D4F: VectorPatch::finalizeAndSortParticles(Params&, SmileiMPI*, SimWindow*, double, Timers&, int) (VectorPatch.cpp:474)
==35829==    by 0x93163B: main._omp_fn.3 (Smilei.cpp:538)
==35829==    by 0x532D4F: main (Smilei.cpp:534)
==35829==  Address 0x10888840 is 16 bytes after a block of size 32 in arena "client"
==35829==
     4090/204081     2.0043e+01     1.9318e+02   (  1.2959e+02 )            2135                                                                                                                                                                                                ==35833== Invalid read of size 8
==35833==    at 0x840223: Particles::initialize(unsigned int, Particles&) (Particles.cpp:102)
==35833==    by 0x844EC6: Patch::endNbrOfParticles(SmileiMPI*, int, Params&, int, VectorPatch*) (Patch.cpp:921)
==35833==    by 0x88F64F: SyncVectorPatch::finalizeExchangeParticles(VectorPatch&, int, int, Params&, SmileiMPI*, Timers&, int) (SyncVectorPatch.cpp:115)
==35833==    by 0x88F75C: SyncVectorPatch::finalizeAndSortParticles(VectorPatch&, int, Params&, SmileiMPI*, Timers&, int) (SyncVectorPatch.cpp:52)
==35833==    by 0x8A5D4F: VectorPatch::finalizeAndSortParticles(Params&, SmileiMPI*, SimWindow*, double, Timers&, int) (VectorPatch.cpp:474)
==35833==    by 0x93163B: main._omp_fn.3 (Smilei.cpp:538)
==35833==    by 0x532D4F: main (Smilei.cpp:534)
==35833==  Address 0x10888898 is 0 bytes after a block of size 24 alloc'd
==35833==    at 0x4C2B734: operator new(unsigned long) (vg_replace_malloc.c:417)
==35833==    by 0x60CE98: allocate (new_allocator.h:104)
==35833==    by 0x60CE98: _M_allocate (stl_vector.h:168)
==35833==    by 0x60CE98: std::vector<std::vector<double, std::allocator<double> >, std::allocator<std::vector<double, std::allocator<double> > > >::_M_default_append(unsigned long) (vector.tcc:549)
==35833==    by 0x83F55B: resize (stl_vector.h:667)
==35833==    by 0x83F55B: Particles::resize(unsigned int, unsigned int, bool) (Particles.cpp:179)
==35833==    by 0x840200: initialize (Particles.cpp:56)
==35833==    by 0x840200: Particles::initialize(unsigned int, Particles&) (Particles.cpp:125)
==35833==    by 0x9470DB: Species::initOperators(Params&, Patch*) (Species.cpp:277)
==35833==    by 0x8875B7: SpeciesFactory::createVector(Params&, Patch*) (SpeciesFactory.h:1294)
==35833==    by 0x8480A6: Patch::finishCreation(Params&, SmileiMPI*, DomainDecomposition*) (Patch.cpp:160)
==35833==    by 0x889111: Patch1D::Patch1D(Params&, SmileiMPI*, DomainDecomposition*, unsigned int, unsigned int) (Patch1D.cpp:26)
==35833==    by 0x932B83: create (PatchesFactory.h:27)
==35833==    by 0x932B83: PatchesFactory::createVector(VectorPatch&, Params&, SmileiMPI*, OpenPMDparams&, RadiationTables*, unsigned int, unsigned int) (PatchesFactory.h:76)
==35833==    by 0x533007: main (Smilei.cpp:201)
==35833==

Are these potential memory leaks?

To run valgrind on Smilei, I followed the instructions here

https://pythondev.readthedocs.io/debug_tools.html

by exporting PYTHONMALLOC=malloc

and use the suppression file here (also uncomment the suppressions for _PyObject_Free and _PyObject_Realloc.

https://raw.githubusercontent.com/python/cpython/main/Misc/valgrind-python.supp

Then run valgrind --tool=memcheck --suppressions=valgrind-python.supp smilei inp.py under srun or mpirun.