LWFA simulation with mesh refinement crashes with Illegal memory access when damped boundary conditions are used

prkkumar commented 2 years ago

LWFA simulation with mesh refinement crashes with Illegal memory access when damped boundary conditions are used with the PSATD solver. When damped bc is replaced by pml, the simulation runs fine. The input file and the Backtrace are attached below: input script:

##################################
############# PSATD ##############
##################################
algo.maxwell_solver = psatd
warpx.do_nodal = 1
psatd.use_default_v_galilean = 1
psatd.nox = 12
psatd.noy = 12
psatd.noz = 12
psatd.nx_guard = 12
psatd.ny_guard = 12
psatd.nz_guard = 12
particles.use_fdtd_nci_corr = 0
algo.current_deposition = direct
warpx.do_single_precision_comms = 0

amr.blocking_factor = 64
amr.max_grid_size = 64
geometry.dims = 3

#################################
########## MESH PATCH ###########
#################################
amr.max_level = 1

#################################
######### BOX PARAMETERS ########
#################################
my_constants.nx = 128
my_constants.nz = 3072
my_constants.xmax = 50e-6
my_constants.zmin = -150e-6
my_constants.zmax = 1e-6
warpx.zmax_plasma_to_compute_max_step = 0.05
amr.n_cell = nx nx nz
boundary.field_lo = periodic periodic damped
boundary.field_hi = periodic periodic damped
# physical domain when running in the lab frame
geometry.prob_lo = -xmax -xmax zmin
geometry.prob_hi = xmax   xmax zmax
warpx.fine_tag_lo = -5.e-6  -5.e-6  -100.e-6
warpx.fine_tag_hi =  5.e-6   5.e-6  -60.e-6
warpx.refine_plasma = 0

#################################
############ NUMERICS ###########
#################################
warpx.verbose = 1
algo.charge_deposition = standard

algo.particle_pusher = vay
algo.particle_shape = 3
warpx.use_filter = 1
warpx.filter_npass_each_dir = 1 1 1
warpx.cfl = 1.

#################################
####### Moving Window ########
#################################
warpx.do_moving_window = 1
warpx.moving_window_dir = z
warpx.moving_window_v = 1.0

#################################
####### BOOST PARAMETERS ########
#################################
warpx.gamma_boost = 30.0
warpx.boost_direction = z

#################################
############ PLASMA #############
#################################
particles.species_names = ions1 electrons1 beam
particles.rigid_injected_species = beam

electrons1.charge = -q_e
electrons1.mass = m_e
electrons1.injection_style = NUniformPerCell
electrons1.num_particles_per_cell_each_dim = 1 1 1
electrons1.momentum_distribution_type = "gaussian"
electrons1.xmin = -46.e-6
electrons1.xmax =  46.e-6
electrons1.ymin = -46.e-6
electrons1.ymax =  46.e-6
electrons1.zmin = 0.0
electrons1.zmax = 1.
electrons1.profile                 = "predefined"
electrons1.predefined_profile_name = "parabolic_channel"
#         predefined_profile_params = z_start   ramp_up   plateau   ramp_down   rc       n0
electrons1.predefined_profile_params = 0.0 1.e-9 .05 1.e-9 20.e-6 3.4e23
electrons1.do_continuous_injection = 1
ions1.charge = q_e
ions1.mass = 1.6726219236900001e-29
ions1.injection_style = NUniformPerCell
ions1.num_particles_per_cell_each_dim = 1 1 1
ions1.momentum_distribution_type = "gaussian"
ions1.xmin = -46.e-6
ions1.xmax =  46.e-6
ions1.ymin = -46.e-6
ions1.ymax =  46.e-6
ions1.zmin = 0.0
ions1.zmax = 1.
ions1.profile                 = "predefined"
ions1.predefined_profile_name = "parabolic_channel"
#    predefined_profile_params = z_start   ramp_up   plateau   ramp_down   rc       n0
ions1.predefined_profile_params = 0.0 1.e-9 .05 1.e-9 20.e-6 3.4e23
ions1.do_continuous_injection = 1

beam.charge = -q_e
beam.mass = m_e
beam.injection_style = "gaussian_beam"
beam.x_rms = 2.0e-6
beam.y_rms = 2.0e-6
beam.z_rms = 5.0e-6
beam.x_m = 0.
beam.y_m = 0.
beam.z_m = -80.e-6
beam.npart = 50000
beam.q_tot = -3.5e-10
beam.momentum_distribution_type = "gaussian"
beam.ux_m = 0.
beam.uy_m = 0.
beam.uz_m = 1.e9
beam.ux_th = 0.
beam.uy_th = 0.
beam.uz_th = 0.
beam.zinject_plane = 0.0
beam.rigid_advance = true

#################################
############# LASER #############
#################################
lasers.names        = laser1
laser1.profile      = Gaussian
laser1.position = 0. 0. -1.e-9  # This point is on the laser plane
laser1.direction    = 0. 0. 1.      # The plane normal direction
laser1.polarization = 0. 1. 0.      # The main polarization vector
laser1.e_max        = 1.605e13       # Maximum amplitude of the laser field (in V/m)
laser1.profile_waist = 20.e-6       # The waist of the laser (in meters)
laser1.profile_duration = 7.33841e-14   # The duration of the laser (in seconds)
laser1.profile_t_peak = 1.46764864e-13 # The time at which the laser reaches its peak (in seconds)
laser1.profile_focal_distance = 0.
laser1.wavelength = 0.8e-6         # The wavelength of the laser (in meters)
laser1.do_continuous_injection = 1

Backtrace:

=== If no file names and line numbers are shown below, one can run
            addr2line -Cpfie my_exefile my_line_address
    to convert `my_line_address` (e.g., 0x4a6b) into file name and line number.
    Or one can use amrex/Tools/Backtrace/parse_bt.py.

=== Please note that the line number reported by addr2line may not be accurate.
    One can use
            readelf -wl my_exefile | grep my_line_address'
    to find out the offset for that line.

 0: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x9132a6]
    _ZN5amrex11BLBackTrace20print_backtrace_infoEP8_IO_FILE
??:0

 1: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x915414]
    _ZN5amrex11BLBackTrace7handlerEi
??:0

 2: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x8fb0cd]
    _ZN5amrex3Gpu6Device17streamSynchronizeEv
??:0

 3: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x4e3303]
    _ZN5amrex8FabArrayINS_9FArrayBoxEE6setValIS1_Li0EEEvdiiRKNS_7IntVectE
??:0

 4: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x4f65a5]
    _ZN3PML8ExchangeERN5amrex8MultiFabES2_RKNS0_8GeometryEi
??:0

 5: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x4f70ab]
    _ZN3PML9ExchangeEE9PatchTypeRKSt5arrayIPN5amrex8MultiFabELm3EEi
??:0

 6: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x6d98e2]
    _ZN5WarpX13FillBoundaryEEN5amrex7IntVectE
??:0

 7: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x655a35]
    _ZN5WarpX13OneStep_nosubEd
??:0

 8: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x657879]
    _ZN5WarpX6EvolveEi
??:0

10: /lib64/libc.so.6(__libc_start_main+0xea) [0x7f60e382534a]
    __libc_start_main
??:0

11: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x4c62fa]
    _start
../sysdeps/x86_64/start.S:122

===== TinyProfilers ======
main()
WarpX::Evolve()
WarpX::Evolve::step
WarpX::OneStep_nosub()
PML::Exchange
FabArray::setVal()

EZoni commented 2 years ago

I tried to print out some info while tracking this issue. I'm running this on Summit (8 nodes, 48 GPUs).

Referring to the function DampFieldsInGuards defined here https://github.com/ECP-WarpX/WarpX/blob/cae924a5b7d3c0d21871c294d518496776dc1961/Source/FieldSolver/WarpXPushFieldsEM.cpp#L706 as well as the function constrain_tilebox_to_guards defined here https://github.com/ECP-WarpX/WarpX/blob/cae924a5b7d3c0d21871c294d518496776dc1961/Source/FieldSolver/WarpXPushFieldsEM_K.H#L27 I printed out the boxes tex and tex_guard from within DampFieldsInGuards as well as the parameters n_guard and upper_bound (so for the lower guard) from within constrain_tilebox_to_guards, and here's an example of what I get, which I want to make sure that I understand correctly:

tex = ((116,52,3316) (204,140,3404) (1,1,1))
n_guard = -3316
upper_bound = 3316
tex_guard = ((116,52,3316) (204,140,3315) (1,1,1))

@atmyers @dpgrote To start with, do you understand the lo and hi bounds along z of the resulting tex_guard in this case? Is it a way to return an "empty" box in z (by having hi < lo) given that this isn't the box storing "outer" guard cells?

EZoni commented 2 years ago

Another small update: it's possible to reproduce the issue with 1024 cells along z (instead of 3072) and by setting warpx.numprocs = 1 1 2 and running on 2 GPUs only (which results in having 2 grids on MR level 0 and 18 grids on MR level 1). Reducing the size of the simulation and also simplifying a bit the domain decomposition should make the debugging a little bit easier.

EZoni commented 2 years ago

This is a situation that causes an out-of-bound access in the last setup described in my previous comment:

tex = ((102,102,1090) (154,154,1144) (1,1,1))
tex_guard = ((102,102,1084) (154,154,1144) (1,1,1))
lbound(Ex_arr) = (102,102,1090)
ubound(Ex_arr) = (154,154,1144)

When we damp the field Ex in the guard cells, we loop over the cells of tex_guard, whose lower bound in z is 1084, but access data from Ex_arr, whose lower bound in z is 1090, as for tex.

EZoni commented 2 years ago

I'm thinking that the fact that we extract the domain information always from Geom(0), so I guess always from level 0, https://github.com/ECP-WarpX/WarpX/blob/53f590c804ea39f040bcbecb5ad7bb6ab939e06e/Source/FieldSolver/WarpXPushFieldsEM.cpp#L756-L757 might be an issue. I will test this hypothesis soon and report here if the fix is relevant.

EZoni commented 2 years ago

@prkkumar I tried the fix above on the smaller simulation I had set up for debugging and it seemed to fix the illegal memory access. Would you be able to try running your original test on the branch of #2809 in order to see if you can confirm independently that the bug fix in that PR actually fixes the issue reported here? Thank you!

ax3l commented 2 years ago

@prkkumar please feel free to reopen if the problem persists. As noted in #2809, we should add CI to cover that functionality.

prkkumar commented 2 years ago

@EZoni I tested the input script posted above on the branch of #2809 on Perlmutter. I am still getting an illegal memory access error at a much later time (time step 4886). The original issue, before your fix, caused error at the first time-step itself. The std output is

amrex::Abort::10::CUDA error 700 in file /global/homes/p/prkumar1/src_damped_mr/src/warpx/build/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp line 642: an illegal memory access was encountered !!!
SIGABRT
See Backtrace.10 file for details
MPICH Notice [Rank 10] [job id 1354302.0] [Wed Feb  2 09:07:39 2022] [nid001088] - Abort(6) (rank 10 in comm 480): application called MPI_Abort(comm=0x84000003, 6) - process 10

srun: error: nid001084: tasks 0,2: Exited with exit code 6
srun: launch/slurm: _step_signal: Terminating StepId=1354302.0
slurmstepd: error: *** STEP 1354302.0 ON nid001084 CANCELLED AT 2022-02-02T17:07:41 ***
srun: error: nid001084: task 1: Exited with exit code 6
srun: error: nid001084: task 3: Terminated
srun: error: nid001088: tasks 8-9: Terminated
srun: error: nid001089: tasks 12,15: Terminated
srun: error: nid001085: tasks 4,7: Terminated
srun: error: nid001089: task 14: Terminated
srun: error: nid001085: task 6: Terminated
srun: error: nid001088: task 11: Terminated
srun: error: nid001085: task 5: Terminated
srun: error: nid001089: task 13: Terminated
srun: error: nid001088: task 10: Terminated
srun: Force Terminated StepId=1354302.0

The Backtrace:

=== If no file names and line numbers are shown below, one can run
            addr2line -Cpfie my_exefile my_line_address
    to convert `my_line_address` (e.g., 0x4a6b) into file name and line number.
    Or one can use amrex/Tools/Backtrace/parse_bt.py.

=== Please note that the line number reported by addr2line may not be accurate.
    One can use
            readelf -wl my_exefile | grep my_line_address'
    to find out the offset for that line.

 0: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x928db6]
    _ZN5amrex11BLBackTrace20print_backtrace_infoEP8_IO_FILE
??:0

 1: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x92af24]
    _ZN5amrex11BLBackTrace7handlerEi
??:0

 2: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x910cfd]
    _ZN5amrex3Gpu6Device17streamSynchronizeEv
??:0

 3: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x7bdb9b]
    _Z15stablePartitionIPlET_S1_S1_RKN5amrex9PODVectorIiNS2_14ArenaAllocatorIiEEEE.isra.0
??:0

 4: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x7be4c8]
    _ZN25PhysicalParticleContainer27PartitionParticlesInBuffersERlS0_lR12WarpXParIteriPKN5amrex9iMultiFabES6_RNS3_9PODVectorIdNS3_14ArenaAllocatorIdEEEESB_SB_SB_
??:0

 5: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x76ae9c]
    _ZN25PhysicalParticleContainer6EvolveEiRKN5amrex8MultiFabES3_S3_S3_S3_S3_RS1_S4_S4_PS1_S5_S5_S5_S5_PS2_S6_S6_S6_S6_S6_dd6DtTypeb
??:0

 6: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x73cd0b]
    _ZN22MultiParticleContainer6EvolveEiRKN5amrex8MultiFabES3_S3_S3_S3_S3_RS1_S4_S4_PS1_S5_S5_S5_S5_PS2_S6_S6_S6_S6_S6_dd6DtTypeb
??:0

 7: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x66750a]
    _ZN5WarpX22PushParticlesandDeposeEid6DtTypeb
??:0

8: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x6689db]
    _ZN5WarpX13OneStep_nosubEd
??:0

 9: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x66ae88]
    _ZN5WarpX6EvolveEi
??:0

10: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x48bc6f]
    main
??:0

11: /lib64/libc.so.6(__libc_start_main+0xea) [0x7f0725cfd34a]
    __libc_start_main
??:0

12: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x4c8c8a]
    _start
../sysdeps/x86_64/start.S:122

===== TinyProfilers ======
main()
WarpX::Evolve()
WarpX::Evolve::step
WarpX::OneStep_nosub()
PhysicalParticleContainer::Evolve()
PhysicalParticleContainer::PartitionParticlesInBuffers

EZoni commented 2 years ago

Thanks @prkkumar, let's reopen the issue then. If the illegal memory access is now occurring after thousands of iterations, it might be related to a part of the code different from the one I fixed in #2809.

A couple of points:

Could you confirm that the issue does not occur with PML boundary conditions, instead of damped?
Would you be able to know how the results look like shortly before the crash, for example, if they look stable or unstable?
Would it be possible to try running the simulation activating AMReX_BOUND_CHECK, which might tell us more about where the illegal memory access is occurring? Ideally we would want to run in DEBUG mode to get the clearest backtraces, but it'd be probably too expensive. Turning ON AMReX_ASSERTIONS might be useful, too.

prkkumar commented 2 years ago

Tested it with pml instead of damped and the simulation runs fine.
There seems to be some instability when damped bc is used. Here are the results with damped bc before the simulation crashes (result at step 4000, simulation crashes between 4000 and 4500) Note the max values of Ex, Ey, Ez are on the scale of 1e38

Corresponding results with pml look reasonable. pml_RhoRho_ions1EyEz pml_ExJxJyJz

prkkumar commented 2 years ago

One more thing: If I visualize the data only on level zero in the case of damped bc, then the results look fine. The high value (1e38) is on level 1. The following results are with level zero data only l0damped_RhoRho_ions1EyEz l0damped_ExJxJyJz

EZoni commented 2 years ago

OK, thank you for the tests.

The illegal memory access could be due to some particles being too fast and trying to deposit their charge and currents outside the regions allocated for deposition. This could be caused by unstable fields that produce electromagnetic forces that are too strong. We have ASSERTS checking this in the code, and we should see such ASSERTS being triggered if we turn ON all assertions in the run (by setting AMReX_ASSERTIONS ON in the CMake build configuration).

One cause of such instability could be again the number of guard cells, which might play a bigger role with damped BCs rather than PMLs. This is just a hypothesis, but I think it'd be worth re-running the simulation by using a number of guard cells that is appropriate, as we did in #2798. If the setup is the same, you could use again 20 guard cells transversally instead of 12. Otherwise, could you post here again dx, dy, dz and dt so that I can quickly look at how many guard cells would be appropriate?

prkkumar commented 2 years ago

Thanks for the suggestions, @EZoni. I tried 20 guard cells transversely, instead of 12 and ran with AMREX_ASSERTIONS turned ON. The code crashes with the following assertion

4::Assertion `amrex::numParticlesOutOfRange(pti, range) == 0' failed, file "/global/homes/p/prkumar1/src_damped_mr/src/warpx/Source/Particles/WarpXParticleContainer.cpp", line 349, Msg: "Particles shape does not fit within tile (CPU) or guard cells (GPU) used for current deposition" !!!
SIGABRT
See Backtrace.4 file for details
MPICH Notice [Rank 4] [job id 1403612.0] [Wed Feb  9 07:33:46 2022] [nid003029] - Abort(6) (rank 4 in comm 480): application called MPI_Abort(comm=0x84000005, 6) - process 4

srun: error: nid003028: tasks 0-1: Exited with exit code 6
srun: launch/slurm: _step_signal: Terminating StepId=1403612.0
slurmstepd: error: *** STEP 1403612.0 ON nid003028 CANCELLED AT 2022-02-09T15:33:46 ***
srun: error: nid003032: task 9: Exited with exit code 6
srun: error: nid003029: tasks 5-6: Exited with exit code 6
srun: error: nid003028: task 2: Exited with exit code 6
srun: error: nid003032: tasks 10-11: Exited with exit code 6
srun: error: nid003033: tasks 14-15: Exited with exit code 6
srun: error: nid003029: task 7: Exited with exit code 6
srun: error: nid003028: task 3: Exited with exit code 6
srun: error: nid003033: task 13: Exited with exit code 6
srun: error: nid003029: task 4: Exited with exit code 6
srun: error: nid003032: task 8: Exited with exit code 6
srun: error: nid003033: task 12: Exited with exit code 6

Backtrace.4 contains


=== If no file names and line numbers are shown below, one can run
            addr2line -Cpfie my_exefile my_line_address
    to convert `my_line_address` (e.g., 0x4a6b) into file name and line number.
    Or one can use amrex/Tools/Backtrace/parse_bt.py.

=== Please note that the line number reported by addr2line may not be accurate.
    One can use
            readelf -wl my_exefile | grep my_line_address'
    to find out the offset for that line.

 0: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/20guards/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x9a7606]
    _ZN5amrex11BLBackTrace20print_backtrace_infoEP8_IO_FILE
??:0

 1: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/20guards/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x9a9781]
    _ZN5amrex11BLBackTrace7handlerEi
??:0

 2: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/20guards/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x85eb80]
    _ZN5amrex11Assert_hostEPKcS1_iS1_
??:0

 3: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/20guards/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x7d2ce9]
    _ZN22WarpXParticleContainer14DepositCurrentER12WarpXParIterRKN5amrex9PODVectorIdNS2_14ArenaAllocatorIdEEEES8_S8_S8_PKiPNS2_8MultiFabESC_SC_lliiidd
??:0

 4: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/20guards/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x7c128c]
    _ZN25PhysicalParticleContainer6EvolveEiRKN5amrex8MultiFabES3_S3_S3_S3_S3_RS1_S4_S4_PS1_S5_S5_S5_S5_PS2_S6_S6_S6_S6_S6_dd6DtTypeb
??:0

 5: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/20guards/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x79016b]
    _ZN22MultiParticleContainer6EvolveEiRKN5amrex8MultiFabES3_S3_S3_S3_S3_RS1_S4_S4_PS1_S5_S5_S5_S5_PS2_S6_S6_S6_S6_S6_dd6DtTypeb
??:0

 6: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/20guards/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x68e2b8]
    _ZN5WarpX22PushParticlesandDeposeEid6DtTypeb
??:0

 7: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/20guards/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x690d8b]
    _ZN5WarpX13OneStep_nosubEd
??:0

 8: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/20guards/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x693b31]
    _ZN5WarpX6EvolveEi
??:0

 9: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/20guards/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x48ab2d]
    main
??:0

10: /lib64/libc.so.6(__libc_start_main+0xea) [0x7fa96097734a]
    __libc_start_main
??:0

11: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/20guards/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x4c7f7a]
   _start
../sysdeps/x86_64/start.S:122

===== TinyProfilers ======
main()
WarpX::Evolve()
WarpX::Evolve::step
WarpX::OneStep_nosub()
PhysicalParticleContainer::Evolve()

EZoni commented 2 years ago

@prkkumar Thank you. So, as we thought, the simulation is unstable for some reasons (high fields and particles that move too fast, triggering that ASSERT and the illegal memory access - not an algorithmic bug in the strict sense anymore). Just checking: you increased only the number of guard cells (from 12 to 20) leaving the order of the spectral solver to 12, right?

One other possible issue here is that we might be using too many guard cells compared to the size of the subdomains used in the PML regions around the coarse and fine patches. Not sure if this could cause such an instability, but it could definitely be a problem (as it is when we have more guard cells than valid cells per subdomain in the regular - not PML - grid). #2779 should introduce a check for this in the PML, but hasn't been merged yet. If such a situation occurs, one might need adjustments similar to those described in this comment, but this is something that requires further discussions with other developers.

prkkumar commented 2 years ago

@EZoni Yes, I used 20 guard cells with order of spectral solver 12.

prkkumar commented 2 years ago

As discussed in a meeting with @EZoni @RemiLehe , I tried with

 // Damp the fields in the guard cells
    for (int lev = 0; lev <= 0; ++lev)
    {
        DampFieldsInGuards(lev, Efield_fp[lev], Bfield_fp[lev]);
    }
}

in https://github.com/ECP-WarpX/WarpX/blob/ca1b886ecb26c8bd20018c4ecfd71a70f11c1d3f/Source/FieldSolver/WarpXPushFieldsEM.cpp#L134-L138

but, I observe the same instability.

EZoni commented 2 years ago

@prkkumar As discussed offline, one thing to try is running the simulation with damped boundary conditions on the branch of #2854 to see if the changes implemented there play any role in the instabilities reported here.

prkkumar commented 2 years ago

@EZoni #2854 resolves the instability issue I was seeing. I tested the above input script on the branch of #2854 and everything looks alright. Thank you!!

ECP-WarpX / WarpX

LWFA simulation with mesh refinement crashes with Illegal memory access when damped boundary conditions are used #2795