AMReX-Combustion / PeleC

An AMR code for compressible reacting flow simulations
https://amrex-combustion.github.io/PeleC
Other
160 stars 71 forks source link

Encountered "named symbol not found" error when I tried to run PMF on RTX4060 #809

Open himcraft opened 3 months ago

himcraft commented 3 months ago

Hello. Recently I wanted to run the PMF case with my laptop GPU, I changed USE_CUDA to TRUE in GNUmakefile and recompiled following the instruction on the document. While when I run ./PeleC3d.gnu.CUDA.ex pmf-lidryer-cvode.inp, it prompted

Initializing AMReX (23.12-8-g43d71da32fa4)...
Initializing CUDA...
CUDA initialized with 1 device.
amrex::Abort::0::GPU last error detected in file /home/himcraft/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 885: named symbol not found !!!
SIGABRT
See Backtrace.0 file for details

The contents in Backtrace.0 are

Host Name: himcraft
=== If no file names and line numbers are shown below, one can run
            addr2line -Cpfie my_exefile my_line_address
    to convert `my_line_address` (e.g., 0x4a6b) into file name and line number.
    Or one can use amrex/Tools/Backtrace/parse_bt.py.

=== Please note that the line number reported by addr2line may not be accurate.
    One can use
            readelf -wl my_exefile | grep my_line_address'
    to find out the offset for that line.

 0: ./PeleC3d.gnu.CUDA.ex(+0x27f0e0) [0x55603e1a90e0]
    amrex::BLBackTrace::print_backtrace_info(_IO_FILE*) at /usr/include/x86_64-linux-gnu/bits/unistd.h:349
 (inlined by) amrex::BLBackTrace::print_backtrace_info(_IO_FILE*) at /home/himcraft/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_BLBackTrace.cpp:199

 1: ./PeleC3d.gnu.CUDA.ex(+0x280f25) [0x55603e1aaf25]
    amrex::BLBackTrace::handler(int) at /home/himcraft/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_BLBackTrace.cpp:99

 2: ./PeleC3d.gnu.CUDA.ex(+0x141001) [0x55603e06b001]
    std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_is_local() const at /usr/include/c++/9/bits/basic_string.h:226
 (inlined by) std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_dispose() at /usr/include/c++/9/bits/basic_string.h:235
 (inlined by) std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() at /usr/include/c++/9/bits/basic_string.h:662
 (inlined by) amrex::Gpu::ErrorCheck(char const*, int) at /home/himcraft/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuError.H:54
 (inlined by) std::enable_if<amrex::MaybeDeviceRunnable<__nv_dl_wrapper_t<__nv_dl_tag<void (*)(unsigned long), &(anonymous namespace)::ResizeRandomSeed, 1u>, unsigned long, curandStateXORWOW*>, void>::value, void>::type amrex::ParallelFor<256, int, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(unsigned long), &(anonymous namespace)::ResizeRandomSeed, 1u>, unsigned long, curandStateXORWOW*>, void>(amrex::Gpu::KernelInfo const&, int, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(unsigned long), &(anonymous namespace)::ResizeRandomSeed, 1u>, unsigned long, curandStateXORWOW*>&&) at /home/himcraft/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H:885
 (inlined by) void amrex::ParallelFor<int, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(unsigned long), &(anonymous namespace)::ResizeRandomSeed, 1u>, unsigned long, curandStateXORWOW*>, void>(int, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(unsigned long), &(anonymous namespace)::ResizeRandomSeed, 1u>, unsigned long, curandStateXORWOW*>&&) at /home/himcraft/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H:1457
 (inlined by) (anonymous namespace)::ResizeRandomSeed(unsigned long) at /home/himcraft/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_Random.cpp:60

 3: ./PeleC3d.gnu.CUDA.ex(+0x105894) [0x55603e02f894]
    amrex::Initialize(int&, char**&, bool, int, std::function<void ()> const&, std::ostream&, std::ostream&, void (*)(char const*)) at /home/himcraft/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX.cpp:628

 4: ./PeleC3d.gnu.CUDA.ex(+0x48492) [0x55603df72492]
    std::_Function_base::~_Function_base() at /usr/include/c++/9/bits/std_function.h:259
 (inlined by) std::function<void ()>::~function() at /usr/include/c++/9/bits/std_function.h:369
 (inlined by) main at /home/himcraft/PeleC/Exec/RegTests/PMF/../../../Source/main.cpp:58

 5: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f1dac95c083]

 6: ./PeleC3d.gnu.CUDA.ex(+0x4a13e) [0x55603df7413e]
    ?? ??:0

The CUDA version is 12.1. Could it be my CUDA driver is installed wrongly? While I can run my own .cu code without error.

Thanks in advance.

SRkumar97 commented 2 months ago

Hi, I am also facing a similar issue. I tried to run the EB-C14 compression ramp case in a dedicated GPU cluster by setting CUDA and MPI flags to TRUE. I kept nprocs=16, ngpu=1 thereby np=16 The case fails to start, reporting an out of memory error by AMReX_Arena.cpp; and an error generated by the same line 749 in AMReX_GpuLaunchFunctsG.H file

Multiple GPUs are visible to each MPI rank, This may lead to incorrect or suboptimal rank-to-GPU mapping.! There are more MPI processes than the number of GPUs.! amrex::Abort::10::CUDA error 2 in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_Arena.cpp line 193: out of memory !!! SIGABRT amrex::Abort::9::CUDA error 2 in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_Arena.cpp line 193: out of memory !!! SIGABRT amrex::Abort::1::CUDA error 2 in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_Arena.cpp line 193: out of memory !!! SIGABRT amrex::Abort::4::GPU last error detected in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 749: invalid device function !!! SIGABRT amrex::Abort::12::GPU last error detected in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 749: invalid device function !!! SIGABRT amrex::Abort::7::GPU last error detected in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 749: invalid device function !!! SIGABRT amrex::Abort::14::GPU last error detected in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 749: invalid device function !!! SIGABRT amrex::Abort::8::GPU last error detected in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 749: invalid device function !!! SIGABRT amrex::Abort::6::GPU last error detected in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 749: invalid device function !!! SIGABRT amrex::Abort::15::GPU last error detected in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 749: invalid device function !!! SIGABRT amrex::Abort::0::GPU last error detected in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 749: invalid device function !!! SIGABRT amrex::Abort::13::GPU last error detected in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 749: invalid device function !!! SIGABRT amrex::Abort::5::GPU last error detected in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 749: invalid device function !!! SIGABRT amrex::Abort::2::GPU last error detected in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 749: invalid device function !!! SIGABRT amrex::Abort::3::GPU last error detected in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 749: invalid device function !!! SIGABRT amrex::Abort::11::GPU last error detected in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 749: invalid device function !!! SIGABRT See Backtrace.4 file for details See Backtrace.12 file for details See Backtrace.7 file for details See Backtrace.15 file for details See Backtrace.6 file for details See Backtrace.14 file for details See Backtrace.5 file for details See Backtrace.13 file for details See Backtrace.0 file for details See Backtrace.8 file for details See Backtrace.9 file for details See Backtrace.1 file for details See Backtrace.2 file for details See Backtrace.10 file for details See Backtrace.3 file for details See Backtrace.11 file for details

MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD with errorcode 6.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

[scn50-mn:1634243] 9 more processes have sent help message help-mpi-api.txt / mpi-abort [scn50-mn:1634243] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

The first issue, i.e. out of memory error caused by number of MPI processes, goes off once I adjust the np count to be same as ngpus. However, the second error, reported by the AMReX_GpuLaunchFunctsG.H file, is still there. Requesting for help!

himcraft commented 6 days ago

It is probably due to cuda version IMO.

When I switched to 12.6 instead of 12.2, the error disappeared. The same error occurred when I recently tested on another HPC system using cuda 12.2.