Closed prkkumar closed 2 years ago
I tried to print out some info while tracking this issue. I'm running this on Summit (8 nodes, 48 GPUs).
Referring to the function DampFieldsInGuards
defined here
https://github.com/ECP-WarpX/WarpX/blob/cae924a5b7d3c0d21871c294d518496776dc1961/Source/FieldSolver/WarpXPushFieldsEM.cpp#L706
as well as the function constrain_tilebox_to_guards
defined here
https://github.com/ECP-WarpX/WarpX/blob/cae924a5b7d3c0d21871c294d518496776dc1961/Source/FieldSolver/WarpXPushFieldsEM_K.H#L27
I printed out the boxes tex
and tex_guard
from within DampFieldsInGuards
as well as the parameters n_guard
and upper_bound
(so for the lower guard) from within constrain_tilebox_to_guards
, and here's an example of what I get, which I want to make sure that I understand correctly:
tex = ((116,52,3316) (204,140,3404) (1,1,1))
n_guard = -3316
upper_bound = 3316
tex_guard = ((116,52,3316) (204,140,3315) (1,1,1))
@atmyers @dpgrote To start with, do you understand the lo
and hi
bounds along z
of the resulting tex_guard
in this case? Is it a way to return an "empty" box in z
(by having hi
< lo
) given that this isn't the box storing "outer" guard cells?
Another small update: it's possible to reproduce the issue with 1024
cells along z
(instead of 3072
) and by setting warpx.numprocs = 1 1 2
and running on 2 GPUs only (which results in having 2 grids on MR level 0 and 18 grids on MR level 1). Reducing the size of the simulation and also simplifying a bit the domain decomposition should make the debugging a little bit easier.
This is a situation that causes an out-of-bound access in the last setup described in my previous comment:
tex = ((102,102,1090) (154,154,1144) (1,1,1))
tex_guard = ((102,102,1084) (154,154,1144) (1,1,1))
lbound(Ex_arr) = (102,102,1090)
ubound(Ex_arr) = (154,154,1144)
When we damp the field Ex in the guard cells, we loop over the cells of tex_guard
, whose lower bound in z
is 1084
, but access data from Ex_arr
, whose lower bound in z
is 1090
, as for tex
.
I'm thinking that the fact that we extract the domain information always from Geom(0)
, so I guess always from level 0
,
https://github.com/ECP-WarpX/WarpX/blob/53f590c804ea39f040bcbecb5ad7bb6ab939e06e/Source/FieldSolver/WarpXPushFieldsEM.cpp#L756-L757
might be an issue. I will test this hypothesis soon and report here if the fix is relevant.
@prkkumar I tried the fix above on the smaller simulation I had set up for debugging and it seemed to fix the illegal memory access. Would you be able to try running your original test on the branch of #2809 in order to see if you can confirm independently that the bug fix in that PR actually fixes the issue reported here? Thank you!
@prkkumar please feel free to reopen if the problem persists. As noted in #2809, we should add CI to cover that functionality.
@EZoni I tested the input script posted above on the branch of #2809 on Perlmutter. I am still getting an illegal memory access error at a much later time (time step 4886). The original issue, before your fix, caused error at the first time-step itself. The std output is
amrex::Abort::10::CUDA error 700 in file /global/homes/p/prkumar1/src_damped_mr/src/warpx/build/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp line 642: an illegal memory access was encountered !!!
SIGABRT
See Backtrace.10 file for details
MPICH Notice [Rank 10] [job id 1354302.0] [Wed Feb 2 09:07:39 2022] [nid001088] - Abort(6) (rank 10 in comm 480): application called MPI_Abort(comm=0x84000003, 6) - process 10
srun: error: nid001084: tasks 0,2: Exited with exit code 6
srun: launch/slurm: _step_signal: Terminating StepId=1354302.0
slurmstepd: error: *** STEP 1354302.0 ON nid001084 CANCELLED AT 2022-02-02T17:07:41 ***
srun: error: nid001084: task 1: Exited with exit code 6
srun: error: nid001084: task 3: Terminated
srun: error: nid001088: tasks 8-9: Terminated
srun: error: nid001089: tasks 12,15: Terminated
srun: error: nid001085: tasks 4,7: Terminated
srun: error: nid001089: task 14: Terminated
srun: error: nid001085: task 6: Terminated
srun: error: nid001088: task 11: Terminated
srun: error: nid001085: task 5: Terminated
srun: error: nid001089: task 13: Terminated
srun: error: nid001088: task 10: Terminated
srun: Force Terminated StepId=1354302.0
The Backtrace:
=== If no file names and line numbers are shown below, one can run
addr2line -Cpfie my_exefile my_line_address
to convert `my_line_address` (e.g., 0x4a6b) into file name and line number.
Or one can use amrex/Tools/Backtrace/parse_bt.py.
=== Please note that the line number reported by addr2line may not be accurate.
One can use
readelf -wl my_exefile | grep my_line_address'
to find out the offset for that line.
0: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x928db6]
_ZN5amrex11BLBackTrace20print_backtrace_infoEP8_IO_FILE
??:0
1: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x92af24]
_ZN5amrex11BLBackTrace7handlerEi
??:0
2: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x910cfd]
_ZN5amrex3Gpu6Device17streamSynchronizeEv
??:0
3: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x7bdb9b]
_Z15stablePartitionIPlET_S1_S1_RKN5amrex9PODVectorIiNS2_14ArenaAllocatorIiEEEE.isra.0
??:0
4: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x7be4c8]
_ZN25PhysicalParticleContainer27PartitionParticlesInBuffersERlS0_lR12WarpXParIteriPKN5amrex9iMultiFabES6_RNS3_9PODVectorIdNS3_14ArenaAllocatorIdEEEESB_SB_SB_
??:0
5: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x76ae9c]
_ZN25PhysicalParticleContainer6EvolveEiRKN5amrex8MultiFabES3_S3_S3_S3_S3_RS1_S4_S4_PS1_S5_S5_S5_S5_PS2_S6_S6_S6_S6_S6_dd6DtTypeb
??:0
6: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x73cd0b]
_ZN22MultiParticleContainer6EvolveEiRKN5amrex8MultiFabES3_S3_S3_S3_S3_RS1_S4_S4_PS1_S5_S5_S5_S5_PS2_S6_S6_S6_S6_S6_dd6DtTypeb
??:0
7: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x66750a]
_ZN5WarpX22PushParticlesandDeposeEid6DtTypeb
??:0
8: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x6689db]
_ZN5WarpX13OneStep_nosubEd
??:0
9: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x66ae88]
_ZN5WarpX6EvolveEi
??:0
10: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x48bc6f]
main
??:0
11: /lib64/libc.so.6(__libc_start_main+0xea) [0x7f0725cfd34a]
__libc_start_main
??:0
12: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x4c8c8a]
_start
../sysdeps/x86_64/start.S:122
===== TinyProfilers ======
main()
WarpX::Evolve()
WarpX::Evolve::step
WarpX::OneStep_nosub()
PhysicalParticleContainer::Evolve()
PhysicalParticleContainer::PartitionParticlesInBuffers
Thanks @prkkumar, let's reopen the issue then. If the illegal memory access is now occurring after thousands of iterations, it might be related to a part of the code different from the one I fixed in #2809.
A couple of points:
AMReX_BOUND_CHECK
, which might tell us more about where the illegal memory access is occurring? Ideally we would want to run in DEBUG mode to get the clearest backtraces, but it'd be probably too expensive. Turning ON AMReX_ASSERTIONS
might be useful, too.Corresponding results with pml look reasonable.
One more thing: If I visualize the data only on level zero in the case of damped bc, then the results look fine. The high value (1e38) is on level 1. The following results are with level zero data only
OK, thank you for the tests.
The illegal memory access could be due to some particles being too fast and trying to deposit their charge and currents outside the regions allocated for deposition. This could be caused by unstable fields that produce electromagnetic forces that are too strong. We have ASSERTS checking this in the code, and we should see such ASSERTS being triggered if we turn ON all assertions in the run (by setting AMReX_ASSERTIONS
ON in the CMake build configuration).
One cause of such instability could be again the number of guard cells, which might play a bigger role with damped BCs rather than PMLs. This is just a hypothesis, but I think it'd be worth re-running the simulation by using a number of guard cells that is appropriate, as we did in #2798. If the setup is the same, you could use again 20 guard cells transversally instead of 12. Otherwise, could you post here again dx
, dy
, dz
and dt
so that I can quickly look at how many guard cells would be appropriate?
Thanks for the suggestions, @EZoni. I tried 20 guard cells transversely, instead of 12 and ran with AMREX_ASSERTIONS
turned ON
. The code crashes with the following assertion
4::Assertion `amrex::numParticlesOutOfRange(pti, range) == 0' failed, file "/global/homes/p/prkumar1/src_damped_mr/src/warpx/Source/Particles/WarpXParticleContainer.cpp", line 349, Msg: "Particles shape does not fit within tile (CPU) or guard cells (GPU) used for current deposition" !!!
SIGABRT
See Backtrace.4 file for details
MPICH Notice [Rank 4] [job id 1403612.0] [Wed Feb 9 07:33:46 2022] [nid003029] - Abort(6) (rank 4 in comm 480): application called MPI_Abort(comm=0x84000005, 6) - process 4
srun: error: nid003028: tasks 0-1: Exited with exit code 6
srun: launch/slurm: _step_signal: Terminating StepId=1403612.0
slurmstepd: error: *** STEP 1403612.0 ON nid003028 CANCELLED AT 2022-02-09T15:33:46 ***
srun: error: nid003032: task 9: Exited with exit code 6
srun: error: nid003029: tasks 5-6: Exited with exit code 6
srun: error: nid003028: task 2: Exited with exit code 6
srun: error: nid003032: tasks 10-11: Exited with exit code 6
srun: error: nid003033: tasks 14-15: Exited with exit code 6
srun: error: nid003029: task 7: Exited with exit code 6
srun: error: nid003028: task 3: Exited with exit code 6
srun: error: nid003033: task 13: Exited with exit code 6
srun: error: nid003029: task 4: Exited with exit code 6
srun: error: nid003032: task 8: Exited with exit code 6
srun: error: nid003033: task 12: Exited with exit code 6
Backtrace.4 contains
=== If no file names and line numbers are shown below, one can run
addr2line -Cpfie my_exefile my_line_address
to convert `my_line_address` (e.g., 0x4a6b) into file name and line number.
Or one can use amrex/Tools/Backtrace/parse_bt.py.
=== Please note that the line number reported by addr2line may not be accurate.
One can use
readelf -wl my_exefile | grep my_line_address'
to find out the offset for that line.
0: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/20guards/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x9a7606]
_ZN5amrex11BLBackTrace20print_backtrace_infoEP8_IO_FILE
??:0
1: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/20guards/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x9a9781]
_ZN5amrex11BLBackTrace7handlerEi
??:0
2: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/20guards/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x85eb80]
_ZN5amrex11Assert_hostEPKcS1_iS1_
??:0
3: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/20guards/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x7d2ce9]
_ZN22WarpXParticleContainer14DepositCurrentER12WarpXParIterRKN5amrex9PODVectorIdNS2_14ArenaAllocatorIdEEEES8_S8_S8_PKiPNS2_8MultiFabESC_SC_lliiidd
??:0
4: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/20guards/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x7c128c]
_ZN25PhysicalParticleContainer6EvolveEiRKN5amrex8MultiFabES3_S3_S3_S3_S3_RS1_S4_S4_PS1_S5_S5_S5_S5_PS2_S6_S6_S6_S6_S6_dd6DtTypeb
??:0
5: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/20guards/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x79016b]
_ZN22MultiParticleContainer6EvolveEiRKN5amrex8MultiFabES3_S3_S3_S3_S3_RS1_S4_S4_PS1_S5_S5_S5_S5_PS2_S6_S6_S6_S6_S6_dd6DtTypeb
??:0
6: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/20guards/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x68e2b8]
_ZN5WarpX22PushParticlesandDeposeEid6DtTypeb
??:0
7: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/20guards/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x690d8b]
_ZN5WarpX13OneStep_nosubEd
??:0
8: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/20guards/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x693b31]
_ZN5WarpX6EvolveEi
??:0
9: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/20guards/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x48ab2d]
main
??:0
10: /lib64/libc.so.6(__libc_start_main+0xea) [0x7fa96097734a]
__libc_start_main
??:0
11: /pscratch/sd/p/prkumar1/ion_motion/mr_psatd/damped_bug/EZoni_fix/20guards/./warpx.3d.MPI.CUDA.DP.OPMD.PSATD.QED() [0x4c7f7a]
_start
../sysdeps/x86_64/start.S:122
===== TinyProfilers ======
main()
WarpX::Evolve()
WarpX::Evolve::step
WarpX::OneStep_nosub()
PhysicalParticleContainer::Evolve()
@prkkumar Thank you. So, as we thought, the simulation is unstable for some reasons (high fields and particles that move too fast, triggering that ASSERT and the illegal memory access - not an algorithmic bug in the strict sense anymore). Just checking: you increased only the number of guard cells (from 12 to 20) leaving the order of the spectral solver to 12, right?
One other possible issue here is that we might be using too many guard cells compared to the size of the subdomains used in the PML regions around the coarse and fine patches. Not sure if this could cause such an instability, but it could definitely be a problem (as it is when we have more guard cells than valid cells per subdomain in the regular - not PML - grid). #2779 should introduce a check for this in the PML, but hasn't been merged yet. If such a situation occurs, one might need adjustments similar to those described in this comment, but this is something that requires further discussions with other developers.
@EZoni Yes, I used 20 guard cells with order of spectral solver 12.
As discussed in a meeting with @EZoni @RemiLehe , I tried with
// Damp the fields in the guard cells
for (int lev = 0; lev <= 0; ++lev)
{
DampFieldsInGuards(lev, Efield_fp[lev], Bfield_fp[lev]);
}
}
but, I observe the same instability.
@prkkumar As discussed offline, one thing to try is running the simulation with damped
boundary conditions on the branch of #2854 to see if the changes implemented there play any role in the instabilities reported here.
@EZoni #2854 resolves the instability issue I was seeing. I tested the above input script on the branch of #2854 and everything looks alright. Thank you!!
LWFA simulation with mesh refinement crashes with Illegal memory access when damped boundary conditions are used with the PSATD solver. When
damped
bc is replaced bypml
, the simulation runs fine. The input file and the Backtrace are attached below: input script:Backtrace: