ECP-WarpX / WarpX

WarpX is an advanced electromagnetic & electrostatic Particle-In-Cell code.
https://ecp-warpx.github.io
Other
291 stars 185 forks source link

Cryptic error message when value_function parameter in Histogram2D not set #4540

Open n01r opened 9 months ago

n01r commented 9 months ago

I recently encountered a cryptic error message when working with the ParticleHistogram2D reduced diagnostics. I forgot to set the parameter value_function(t,x,y,z,ux,uy,uz,w). It would be good if we had something that told the user more clearly what to do.

How to reproduce: Just run the example at Examples/Physical_applications/laser_ion but comment out PhaseSpaceElectrons.value_function(t,x,y,z,ux,uy,uz,w) = "w" from inputs_2d before running it.

Tested on one node on Crusher (OLCF)

Error output

Memory access fault by GPU node-9 (Agent handle: 0x2ac0aa0) on address (nil). Reason: Unknown.
SIGABRT
Memory access fault by GPU node-6 (Agent handle: 0x2ac0aa0) on address (nil). Reason: Unknown.
SIGABRT
Memory access fault by GPU node-7 (Agent handle: 0x2ac0aa0) on address (nil). Reason: Unknown.
SIGABRT
Memory access fault by GPU node-8 (Agent handle: 0x2ac0aa0) on address (nil). Reason: Unknown.
SIGABRT
See Backtrace.0 file for details
See Backtrace.2 file for details
See Backtrace.1 file for details
See Backtrace.3 file for details
MPICH ERROR [Rank 3] [job id 424102.0] [Mon Dec 18 21:15:34 2023] [crusher020] - Abort(6) (rank 3 in comm 496): application called MPI_Abort(comm=0x84000001, 6) - process 3

Segfault
MPICH ERROR [Rank 0] [job id 424102.0] [Mon Dec 18 21:15:35 2023] [crusher020] - Abort(6) (rank 0 in comm 496): application called MPI_Abort(comm=0x84000002, 6) - process 0

Segfault
MPICH ERROR [Rank 2] [job id 424102.0] [Mon Dec 18 21:15:35 2023] [crusher020] - Abort(6) (rank 2 in comm 496): application called MPI_Abort(comm=0x84000001, 6) - process 2

Segfault
MPICH ERROR [Rank 1] [job id 424102.0] [Mon Dec 18 21:15:35 2023] [crusher020] - Abort(6) (rank 1 in comm 496): application called MPI_Abort(comm=0x84000001, 6) - process 1

Segfault
See Backtrace.3 file for details
See Backtrace.0 file for details
See Backtrace.2 file for details
See Backtrace.1 file for details
MPICH ERROR [Rank 3] [job id 424102.0] [Mon Dec 18 21:15:38 2023] [crusher020] - Abort(11) (rank 3 in comm 496): application called MPI_Abort(comm=0x84000001, 11) - process 3

MPICH ERROR [Rank 0] [job id 424102.0] [Mon Dec 18 21:15:38 2023] [crusher020] - Abort(11) (rank 0 in comm 496): application called MPI_Abort(comm=0x84000002, 11) - process 0

MPICH ERROR [Rank 2] [job id 424102.0] [Mon Dec 18 21:15:38 2023] [crusher020] - Abort(11) (rank 2 in comm 496): application called MPI_Abort(comm=0x84000001, 11) - process 2

MPICH ERROR [Rank 1] [job id 424102.0] [Mon Dec 18 21:15:38 2023] [crusher020] - Abort(11) (rank 1 in comm 496): application called MPI_Abort(comm=0x84000001, 11) - process 1

srun: error: crusher020: task 0: Segmentation fault
srun: Terminating StepId=424102.0
slurmstepd: error: *** STEP 424102.0 ON crusher020 CANCELLED AT 2023-12-18T21:15:38 ***
srun: error: crusher020: tasks 1-2: Segmentation fault
srun: error: crusher020: tasks 4-7: Terminated
srun: error: crusher020: task 3: Segmentation fault (core dumped)
srun: Force Terminated StepId=424102.0
ax3l commented 8 months ago

Hi @n01r, what was printed on stdout?

n01r commented 8 months ago

Hi @ax3l, this is how far the stdout got:

Initializing AMReX (23.12)...
MPI initialized with 8 MPI processes
MPI initialized with thread support level 3
Initializing HIP...
HIP initialized with 8 devices.
AMReX (23.12) initialized
PICSAR (23.09)
WarpX (23.11-63-gb9b0748f1d00)

    __        __             __  __
    \ \      / /_ _ _ __ _ __\ \/ /
     \ \ /\ / / _` | '__| '_ \\  /
      \ V  V / (_| | |  | |_) /  \
       \_/\_/ \__,_|_|  | .__/_/\_\
                        |_|

Level 0: dt = 1.530214125e-17 ; dx = 5.580357143e-09 ; dz = 8.081896552e-09

Grids Summary:
  Level 0   8 grids  9977856 cells  100 % of domain
            smallest grid: 1344 x 928  biggest grid: 1344 x 928

-------------------------------------------------------------------------------
--------------------------- MAIN EM PIC PARAMETERS ----------------------------
-------------------------------------------------------------------------------
Precision:            | DOUBLE
Particle precision:   | DOUBLE
Geometry:             | 2D (XZ)
Operation mode:       | Electromagnetic
                      | - vacuum
-------------------------------------------------------------------------------
Current Deposition:   | Esirkepov
Particle Pusher:      | Boris
Charge Deposition:    | standard
Field Gathering:      | energy-conserving
Particle Shape Factor:| 3
-------------------------------------------------------------------------------
Maxwell Solver:       | Yee
                      | - staggered grid
Guard cells           | - ng_alloc_EB = (4,4)
 (allocated for E/B)  |
-------------------------------------------------------------------------------
For full input parameters, see the file: warpx_used_inputs

--- INFO    : Writing plotfile diags/diag1000000
--- INFO    : Writing openPMD file diags/openPMDfw000000
--- INFO    : Writing openPMD file diags/openPMDbw000000

And here the submit script:

crusher_2D3V.sbatch ``` #!/usr/bin/env bash #SBATCH -A aph114 # note: WarpX ECP members use aph114 #SBATCH -J WarpX #SBATCH -o %x-%j.out #SBATCH -t 01:00:00 #SBATCH -p batch #SBATCH --ntasks-per-node=8 # Since 2022-12-29 Crusher is using a low-noise mode layout, # making only 7 instead of 8 cores available per process # https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html#id6 #SBATCH --cpus-per-task=7 #SBATCH --gpus-per-task=1 #SBATCH --gpu-bind=closest #SBATCH -N 1 # From the documentation: # Each Crusher compute node consists of [1x] 64-core AMD EPYC 7A53 # "Optimized 3rd Gen EPYC" CPU (with 2 hardware threads per physical core) with # access to 512 GB of DDR4 memory. # Each node also contains [4x] AMD MI250X, each with 2 Graphics Compute Dies # (GCDs) for a total of 8 GCDs per node. The programmer can think of the 8 GCDs # as 8 separate GPUs, each having 64 GB of high-bandwidth memory (HBM2E). # note (5-16-22, OLCFHELP-6888) # this environment setting is currently needed on Crusher to work-around a # known issue with Libfabric #export FI_MR_CACHE_MAX_COUNT=0 # libfabric disable caching # or, less invasive: export FI_MR_CACHE_MONITOR=memhooks # alternative cache monitor # Seen since August 2023 on Frontier, adapting the same for Crusher # OLCFDEV-1597: OFI Poll Failed UNDELIVERABLE Errors # https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#olcfdev-1597-ofi-poll-failed-undeliverable-errors export MPICH_SMP_SINGLE_COPY_MODE=NONE export FI_CXI_RX_MATCH_MODE=software # note (9-2-22, OLCFDEV-1079) # this environment setting is needed to avoid that rocFFT writes a cache in # the home directory, which does not scale. export ROCFFT_RTC_CACHE_PATH=/dev/null export OMP_NUM_THREADS=1 export WARPX_NMPI_PER_NODE=8 export TOTAL_NMPI=$(( ${SLURM_JOB_NUM_NODES} * ${WARPX_NMPI_PER_NODE} )) srun -N${SLURM_JOB_NUM_NODES} -n${TOTAL_NMPI} --ntasks-per-node=${WARPX_NMPI_PER_NODE} \ ./warpx.2d inputs_2d > output_${SLURM_JOBID}.txt ```
pordyna commented 7 months ago

@ax3l I just run today into the exact same problem. It was somewhat hard to debug, would you consider making this a required option until this has been solved? That way, other users won't run into this error again.

Btw. this is the stderr that I get on perlmutter:

amrex::Abort::6::CUDA error 700 in file /global/homes/p/pordyna/src/warpx/build_pm_gpu/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp line 614: an illegal memory access was encountered !!!
SIGABRT
amrex::Abort::1::CUDA error 700 in file /global/homes/p/pordyna/src/warpx/build_pm_gpu/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp line 614: an illegal memory access was encountered !!!
SIGABRT
amrex::Abort::3::CUDA error 700 in file /global/homes/p/pordyna/src/warpx/build_pm_gpu/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp line 614: an illegal memory access was encountered !!!
SIGABRT