SmileiPIC / Smilei

Particle-in-cell code for plasma simulation
https://smileipic.github.io/Smilei
335 stars 119 forks source link

Segmentation faults related to injector's position copy #625

Closed Tissot11 closed 1 year ago

Tissot11 commented 1 year ago

Hi,

I am again having segmentation faults similar to what I reported in #611 for 1D simulations. The issue was certainly fixed for 1D simulations but now I use the same parameters in a 2D setup with periodic boundary conditions in y-directions for both particles and EM fields. I also tried with PML boundary conditions but again segmentation faults. I paste below part of .out file

Stack trace (most recent call last):

12 Object "[0xffffffffffffffff]", at 0xffffffffffffffff, in

11 Object "/u/nkumar/CodeRepositeCobra/Smilei-v4.7-current/./smilei", at 0x477c28, in _start

10 Object "/lib64/libc.so.6", at 0x2b6629efeac4, in __libc_start_main

9 Object "/u/nkumar/CodeRepositeCobra/Smilei-v4.7-current/./smilei", at 0x9f0984, in main

8 Object "/mpcdf/soft/SLE_12/packages/x86_64/intel_oneapi/2022.3/compiler/latest/linux/compiler/lib/intel64_lin/libiomp5.so", at 0x2b6629b22564, in __kmpc_fork_call

7 Object "/mpcdf/soft/SLE_12/packages/x86_64/intel_oneapi/2022.3/compiler/latest/linux/compiler/lib/intel64_lin/libiomp5.so", at 0x2b6629b6773c, in __kmp_fork_call

6 Object "/mpcdf/soft/SLE_12/packages/x86_64/intel_oneapi/2022.3/compiler/latest/linux/compiler/lib/intel64_lin/libiomp5.so", at 0x2b6629b66472, in

5 Object "/mpcdf/soft/SLE_12/packages/x86_64/intel_oneapi/2022.3/compiler/latest/linux/compiler/lib/intel64_lin/libiomp5.so", at 0x2b6629bf6b12, in __kmp_invoke_microtask

4 Object "/u/nkumar/CodeRepositeCobra/Smilei-v4.7-current/./smilei", at 0x9ef2b0, in main

3 Object "/u/nkumar/CodeRepositeCobra/Smilei-v4.7-current/./smilei", at 0x8ac14b, in VectorPatch::dynamics(Params&, SmileiMPI, SimWindow, RadiationTables&, MultiphotonBreitW$

2 Object "/u/nkumar/CodeRepositeCobra/Smilei-v4.7-current/./smilei", at 0x8ac6dc, in VectorPatch::dynamicsWithoutTasks(Params&, SmileiMPI, SimWindow, RadiationTables&, Multi$

1 Object "/u/nkumar/CodeRepositeCobra/Smilei-v4.7-current/./smilei", at 0xa2ccac, in Species::dynamics(double, unsigned int, ElectroMagn, Params&, bool, PartWalls, Patch*, S$

0 Object "/u/nkumar/CodeRepositeCobra/Smilei-v4.7-current/./smilei", at 0x9c7dd9, in Projector2D2Order::currentsAndDensityWrapper(ElectroMagn, Particles&, SmileiMPI, int, in$

Segmentation fault (Address not mapped to object [0x801afadf8])

Any suggestions on how to proceed further...

mccoys commented 1 year ago

Are you using the master branch? develop branch? Did you update recently?

The bug seems different to me. Does it happen in 1 configuration only? Only with injectors?

Tissot11 commented 1 year ago

I fetched it in March from the devel brach where you have pushed it. But I have also downloaded it from the usual git clone link so it must be the master branch. This bug seems to only happen with particle injectors. Other 2D simulation (a different plasma physics problem) without injectors seem to running fine.

mccoys commented 1 year ago

Do you have a minimal input file for reproducing the bug?

Tissot11 commented 1 year ago

This is the 2D version of the same input file that is crashing.

d15_th75_mi25.py.txt

Tissot11 commented 1 year ago

Just to point out that this namelist also works fine if I use older version of Smilei (v4.6) downloaded last year with older versions of intel compiler and intel MPI.

mccoys commented 1 year ago

I made a much faster input file to reproduce the bug

# ----------------------------------------------------------------------------------------
#                     SIMULATION PARAMETERS
# ----------------------------------------------------------------------------------------
import math, random, os
import numpy as np                    
l0 = 1.0   
Lx = 512.0* l0                    
Ly = 8.0*l0                                         
tsim = 1.0*10**2                   
loc = 256.0* l0                  
dx = l0/5.                       
dy = l0/2. 
mi = 25.0                         
#mAlfven = 63.                      
#mSonic = 70.                       
vUPC = 0.15                        
angleDegrees = 75.0                

nUP = 1.0
u1x = 0.15
u1y = 0.0
TeUP = 0.001175
TiUP = 0.001175
B1x = 0.009859
B1y = 0.03679
E1z = ( - u1x*B1y + u1y*B1x)             
ppcu = 16                                

nDown = 3.99
u2x =  0.0375
u2y = 0.0001886
TeDown = 0.2156
TiDown = 0.8627
B2x = 0.00985
B2y = 0.14700
E2z = ( -u2x*B2y + u2y*B2x)                  
ppcd = 16                               

xin = -6*dx                  
yin = -6*dy  

slope1 = 50.0                
slope2 = 100.0               

dt = float(0.95/np.sqrt( dx**-2 + dy**-2))

Main(
    geometry = "2Dcartesian",

    interpolation_order = 2,

    timestep = dt,
    simulation_time = tsim,

    cell_length = [dx, dy],
    grid_length  = [Lx, Ly],
    number_of_patches = [ 16, 2 ],

    EM_boundary_conditions = [ ['silver-muller','silver-muller'], ["periodic","periodic"] ] ,

)

def upStreamDens(x,y):
    return nUP* 0.5* ( 1 + np.tanh( - ( x - loc ) / slope1 ) )

Species(
    name = 'eon1',
    position_initialization = 'random',
    momentum_initialization = 'maxwell-juettner',
    particles_per_cell = 0,
    mass = 1.0,
    charge = -1.0,
    number_density = upStreamDens,
    mean_velocity = [u1x,u1y,0.0],
    temperature = [TeUP],
    boundary_conditions = [
       ["remove", "remove"], ["periodic","periodic"] ],
)

Species(
    name = 'ion1',
    position_initialization = 'random',
    momentum_initialization = 'mj',
    particles_per_cell = 0,
    mass = mi, 
    charge = 1.0,
    number_density = upStreamDens,
    mean_velocity = [u1x,u1y,0.0],
    temperature = [TiUP],
    boundary_conditions = [
        ["remove", "remove"], ["periodic","periodic"] ],
)

ParticleInjector(
    name      = "Inj_eon1",
    species   = "eon1",
    box_side  = "xmin",
    position_initialization = "random",
    mean_velocity = [u1x,u1y,0.0],
    number_density = nUP,
    particles_per_cell = ppcu,
)
ParticleInjector(
    name      = "Inj_ion1",
    species   = "ion1",
    box_side  = "xmin",
    position_initialization = "Inj_eon1",
    mean_velocity = [u1x,u1y,0.0],
    number_density = nUP,
    particles_per_cell = ppcu,
)
mccoys commented 1 year ago

Ok so the issue lies in the position copy from the 1st to the 2nd injector. Are you sure this worked in 1D? I have not checked but it seems the error would be the same.

Tissot11 commented 1 year ago

It did work in 1D last time (in March) and also now. But there was one strange thing that I didn't inform you before, on one machine (using older intel compiler, mpi and hdf5), I could run the simulation in 1D too without your fix in March. On this machine current 1D version (from Github) is still working but I haven't checked on other machines. 2D version is crashing anyway.

mccoys commented 1 year ago

I just pushed a fix in develop. Basic tests are passing, but I did not check the physical picture is correct. Could you please tell me if everything works as you expect it?

Tissot11 commented 1 year ago

I have checked it on two systems and simulations seem to be running fine. I'll analyse the results and let you know if there is any concern.

Tissot11 commented 1 year ago

An update: simulations using Intel compiler + Intel MPI are working fine. However, on Juwels supercomputer, using GCC/11.3.0 OpenMPI/4.1.4 HDF5/1.12.2, and also using Intel/2022.1.0 ParaStationMPI/5.8.0-1-mt HDF5/1.12.2 either I see segmentation faults or simulation getting stuck (not at a fixed simulation time) using the same namelist. Which compilers and MPI versions do you recommend?

mccoys commented 1 year ago

Are these the same segfault as before? Are you sure you got the same smilei version on both?

These compilers should be ok

Tissot11 commented 1 year ago

No, they are different. I attach one err file of a crashed simulation. The other one just got stuck in computing. I use the same Smilei version fetched from the develop branch last week on each machine. I'm not sure if this has something to with the modules installed on Juwels.

tjob_hybrid.err.7685072.txt

mccoys commented 1 year ago

These might be related to memory limitations I guess. If the processors are different you may need to adapt the box decomposition

Tissot11 commented 1 year ago

Thanks. I'll try it. I attach the stderr file for the crashed simulation with Intel/2022.1.0 ParaStationMPI/5.8.0-1-mt HDF5/1.12.2 options. This message is different than the one before. The same simulation with Intel MPI is running fine.

tjob_hybrid.err.7713303.txt

mccoys commented 1 year ago

We have never tested with parastation mpi, and not even with mpich. I recommend you keep using Intel mpi or open mpi

Tissot11 commented 1 year ago

I would always prefer to use Intel compiler and Intel MPI as I get a better performance compared to other combinations. However, on Juwels they plan to drop support for Intel MPI soon and recommend ParastationMPI or OpenMPI with GCC. I had trouble with both these combinations. I'm yet to try Intel + OpenMPI combination. Is this the combination you have tested Smilei already?

mccoys commented 1 year ago

Yes we have done that combination in the past, but things are never simple and subtle parameters of the compiler may change things.

For instance, make sure that your mpi library was compiled with support for MPI_THREAD_MULTIPLE

Tissot11 commented 1 year ago

I'm having trouble at one machine where Smilei compiles fine but gets stuck at solving Poisson solver at t=0. I don't see any other errors and this is the same Namelist as pasted before. This Namelist never gets stuck at solving Poisson solver at t=0 everywhere else. On this cluster, user support and documentation is sub-par. Any suggestions, what to try? I used following modules for compiling and running.

module load compiler/intel/2022.0.2 numlib/mkl/2022.0.2 mpi/impi/2021.5.1 lib/hdf5/1.12

I had to pass --mpi=pmi2 after srun command on this machine for the simulation to start.

mccoys commented 1 year ago

Have you tried to compile with make config=no_mpi_tm?

Tissot11 commented 1 year ago

No. With option I should only use MPI processes and not OpenMP threads for running the simulations?

mccoys commented 1 year ago

You can use openMP as usual. This option simply disable some capability by your MPI library but smilei can still run with MPI + openMP.

Tissot11 commented 1 year ago

Ok, thanks. I try it now and let you know this evening.

Tissot11 commented 1 year ago

It doesn't help. It still gets stuck at Poisson solver at t=0.

mccoys commented 1 year ago

This looks like a problem with MPI. Maybe try to run with 1 thread only, just to check if it is due to openmp instead.

You should try to run test applications for MPI on this machine. Check the Intel benchmark suite for instance.

Other than the configuration above, smilei does not have specific MPI requirements. I don't think we can be of much help here.

Tissot11 commented 1 year ago

Thanks. I'll try these suggestions. Since the original issue was already resolved, you can now close this ticket.