Closed n01r closed 2 years ago
Memo from our discussion:
cuda-gdb
with AMReX runtime options amrex.throw_exception = 1 amrex.signal_handling = 0
With cudatoolkit/11.5
, running cuda-gdb
gives an error
But swapping it out for cudatoolkit/11.0
lets me run the debugger.
I could not see any values for variables because the compiler optimizes them out in Drift.H
.
69 p.pos(0) = x + m_ds * px;
(cuda-gdb) print px
$1 = <optimized out>
(cuda-gdb) print x
$2 = <optimized out>
(cuda-gdb) print p
$3 = (@local _ZN7impactx5Drift5PTypeE & @local) <error reading variable>
(cuda-gdb) break Drift.H:67
So I built again with the option g -O0
and hopefully I will see more.
Edit:
... I actually tried to build it again without optimization but it still shows <optimized out>
. Should I have deleted the build directory completely before?
The object p is complicated struct, so I think the final line makes sense. I'm not sure if gdb will allow a print p.pos(0), etc.
In the end, the current AMReX particle AoS object p
is really just a
struct {
amrex::ParticleReal r[n];
int i[m];
};
You could check in cuda-gdb if the object p
is valid memory (on the device) itself by printing its address and checking its range and then printing it's first member (which we interpret as position x).
... I actually tried to build it again without optimization but it still shows
. Should I have deleted the build directory completely before?
yes, you need to redo the configure step with a fresh build dir. CXXFLAGS are only added at the first configure in a build directory (they change defaults for the configure step).
yes, you need to redo the configure step with a fresh build dir. CXXFLAGS are only added at the first configure in a build directory (they change defaults for the configure step).
But deleting build
, running cmake -S . -B build
and then doing ccmake build
, editing stuff, hitting c
to configure and g
to generate should work, no?
that should work in general... doing it with a single configure is the safest bet if you are unsure though.
You can configure with -DCMAKE_VERBOSE_MAKEFILE=ON
if you are unsure what's ending up on the compiler line and want to see.
cc @WeiqunZhang @atmyers @kngott turns out this is in part a bug in AMReX init with GPU-aware MPI on Perlmutter.
If I set export MPICH_GPU_SUPPORT_ENABLED=0
the issue Cuda API error detected: cuPointerGetAttribute returned (0x1)
vanishes. Backtrace:
The other issue above is an when we try to access fundamental types (not even pointers) of lattice elements on device, e.g., the amrex::ParticleReal m_ds
member: CUDA Exception: Warp Illegal Address
. The problem is so weird that I start to think it's a compiler bug... and it probably is: #174
Hi, I tried to run the FODO example without changes on the Perlmutter GPU partition and encountered the following error:
I first tried the example with the submit script that is provided by the docs (however, had to change the naming a bit since it's still copied from WarpX). This configuration used 4 nodes but I also tried a single node and then just a single GPU per node. All fail with the same error.
The backtrace reads the following:
Backtrace.0
``` === If no file names and line numbers are shown below, one can run addr2line -Cpfie my_exefile my_line_address to convert `my_line_address` (e.g., 0x4a6b) into file name and line number. Or one can use amrex/Tools/Backtrace/parse_bt.py. === Please note that the line number reported by addr2line may not be accurate. One can use readelf -wl my_exefile | grep my_line_address' to find out the offset for that line. 0: /pscratch/sd/m/mgarten/impactx/001_FODO_single-GPU/./impactx() [0x5d63b6] amrex::BLBackTrace::print_backtrace_info(_IO_FILE*) at ??:? 1: /pscratch/sd/m/mgarten/impactx/001_FODO_single-GPU/./impactx() [0x5d875c] amrex::BLBackTrace::handler(int) at ??:? 2: /pscratch/sd/m/mgarten/impactx/001_FODO_single-GPU/./impactx() [0x5c13e9] amrex::Gpu::Device::streamSynchronizeAll() at ??:? 3: /pscratch/sd/m/mgarten/impactx/001_FODO_single-GPU/./impactx() [0x5b6165] amrex::MFIter::~MFIter() at ??:? 4: /pscratch/sd/m/mgarten/impactx/001_FODO_single-GPU/./impactx() [0x473e79] impactx::Push(impactx::ImpactXParticleContainer&, std::__cxx11::listI am adding the
machine/system
label since an earlier Slack message from @ax3l said that it runs on Summit V100s without fail.