AMReX-Codes / amrex

AMReX: Software Framework for Block Structured AMR
https://amrex-codes.github.io/amrex
Other
516 stars 339 forks source link

Spock: pthread issue #2311

Open ax3l opened 2 years ago

ax3l commented 2 years ago

from @WeiqunZhang via <unknown user> report on Spock (OLCF).

On a login node:

module load cmake/3.21.2-dev rocm/4.3.0
git clone git@github.com:AMReX-Codes/amrex-tutorials.git
cd amerx-tutorials
cmake -S . \
-B build/3d.gnu.float.hip \
-DAMReX_FORTRAN=OFF \
-DAMReX_GPU_BACKEND=HIP \
-DAMReX_AMD_ARCH=gfx908 \
-DAMReX_OMP=OFF \
-DAMReX_MPI=OFF \
-DAMReX_LINEAR_SOLVERS=OFF \
-DAMReX_PRECISION=SINGLE \
-DAMReX_SPACEDIM=3 \
-DCMAKE_CXX_COMPILER=/opt/rocm-4.3.0/llvm/bin/clang++ \
-DCMAKE_CXX_STANDARD=17 \
-DCMAKE_VERBOSE_MAKEFILE:BOOL=ON \
-DAMReX_TINY_PROFILE=OFF -DAMReX_BASE_PROFILE=OFF \
-DAMReX_AMRLEVEL=OFF \
-DCMAKE_BUILD_TYPE=Release
cmake --build build/3d.gnu.float.hip -j 12

results in

[ 72%] Linking CXX executable Amr_Advection_AmrCore
cd /ccs/home/wqzhang/mygitrepo/amrex-tutorials/build/3d.gnu.float.hip/Amr/Advection_AmrCore && /autofs/nccs-svm1_sw/spock/spack-envs/base/opt/linux-sles15-x86_64/gcc-7.5.0/cmake-3.21.2-dev-ovcgpray6yyjz2n7wjuv6lv4qkgietzs/bin/cmake -E cmake_link_script CMakeFiles/Amr_Advection_AmrCore.dir/link.txt --verbose=1
/opt/rocm-4.3.0/llvm/bin/clang++ -O3 -DNDEBUG -fgpu-rdc CMakeFiles/Amr_Advection_AmrCore.dir/Source/AdvancePhiAllLevels.cpp.o CMakeFiles/Amr_Advection_AmrCore.dir/Source/AdvancePhiAtLevel.cpp.o CMakeFiles/Amr_Advection_AmrCore.dir/Source/AmrCoreAdv.cpp.o CMakeFiles/Amr_Advection_AmrCore.dir/Source/DefineVelocity.cpp.o CMakeFiles/Amr_Advection_AmrCore.dir/Source/main.cpp.o -o Amr_Advection_AmrCore  -Wl,-rpath,/opt/rocm-4.3.0/hip/lib:/opt/rocm-4.3.0/lib:/opt/rocm-4.3.0/hiprand/lib:/opt/rocm-4.3.0/rocrand/lib ../../_deps/amrex-build/Src/libamrex.a /opt/rocm-4.3.0/hip/lib/libamdhip64.so.4.3.40300 --hip-link --offload-arch=gfx908 -L"/opt/rocm-4.3.0/llvm/lib/clang/13.0.0/include/../lib/linux" -lclang_rt.builtins-x86_64 /opt/rocm-4.3.0/hiprand/lib/libhiprand.so.1.1.40300 /opt/rocm-4.3.0/rocrand/lib/librocrand.so.1.1.40300 -Wl,-rpath-link,/opt/rocm-4.3.0/lib
ld.lld: error: undefined symbol: pthread_create
>>> referenced by AMReX_BackgroundThread.cpp
>>>               AMReX_BackgroundThread.cpp.o:(amrex::BackgroundThread::BackgroundThread()) in archive ../../_deps/amrex-build/Src/libamrex.a
ax3l commented 2 years ago

Most likely issue: https://github.com/ROCmSoftwarePlatform/rocRAND/pull/29#issuecomment-912815457

$ ldd /opt/rocm-4.3.0/rocrand/lib/librocrand.so.1.1.40300
...
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f0650e4e000)
...
ax3l commented 2 years ago

Actually, it looks like we miss it, since I cannot find another libpthread dependency unresolved in rocrand.

ld.lld: error: undefined symbol: pthread_create
>>> referenced by AMReX_BackgroundThread.cpp
>>>               AMReX_BackgroundThread.cpp.o:(amrex::BackgroundThread::BackgroundThread()) in archive ../../_deps/amrex-build/Src/libamrex.a

Interesting, since we search and link pthreads: https://github.com/AMReX-Codes/amrex/blob/168a690497396de4c6b89a36b6edb0430e51ef4c/Tools/CMake/AMReXParallelBackends.cmake#L1-L8

ax3l commented 2 years ago

The CMake output from this setup:

-- The C compiler identification is Clang 12.0.0
-- The CXX compiler identification is Clang 13.0.0
...
-- Check for working C compiler: /opt/cray/pe/craype/2.7.8/bin/cc - skipped
...
-- Check for working CXX compiler: /opt/rocm-4.3.0/llvm/bin/clang++ - skipped

is concerning. Looks like the Cray and the AMD Clang are mixed.

One should add

-DCMAKE_C_COMPILER=/opt/rocm-4.3.0/llvm/bin/clang

too for consistency.

WeiqunZhang commented 2 years ago

Yes, we should do that. That seems to fix the pthread issue.

WeiqunZhang commented 2 years ago

Let's ignore the errors in compiling tutorials that use AmrLeve. If I run amrex-tutorials/build/3d.gnu.float.hip/Basic/HelloWorld_C/Basic_HelloWorld_C, I get

Initializing HIP...
HIP initialized.
"Cannot find Symbol"
SIGABRT
See Backtrace.0 file for details

So now we have reproduced the symbol issue reported to us.

ax3l commented 2 years ago

Compiling now with

cmake -S . -B build/3d.gnu.float.hip -DAMReX_FORTRAN=OFF -DAMReX_GPU_BACKEND=HIP -DAMReX_AMD_ARCH=gfx908 -DAMReX_OMP=OFF -DAMReX_MPI=OFF -DAMReX_PRECISION=SINGLE -DAMReX_SPACEDIM=3 -DCMAKE_CXX_COMPILER=/opt/rocm-4.3.0/llvm/bin/clang++ -DCMAKE_CXX_STANDARD=17 -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=/opt/rocm-4.3.0/llvm/bin/clang
cmake --build build/3d.gnu.float.hip -j 12

to reproduce

ax3l commented 2 years ago

With cmake 3.20.2 we can use hipcc as CXX Compiler:

cmake -S . -B build/3d.gnu.float.hip -DAMReX_FORTRAN=OFF -DAMReX_GPU_BACKEND=HIP -DAMReX_AMD_ARCH=gfx908 -DAMReX_OMP=OFF -DAMReX_MPI=OFF -DAMReX_PRECISION=SINGLE -DAMReX_SPACEDIM=3 -DCMAKE_CXX_COMPILER=hipcc -DCMAKE_CXX_STANDARD=17 -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=/opt/rocm-4.3.0/llvm/bin/clang

So just some llvm magic flags from hipcc missing.

ax3l commented 2 years ago

Same thing with cmake/3.21.2-dev unravels the hipcc to clang++.

Now we have to work around that already fixed upstream bug about defaults in -x cxx and -x hip front-ends: (ref)

export CXXFLAGS="-std=c++17"
cmake -S . -B build/3d.gnu.float.hip -DAMReX_FORTRAN=OFF -DAMReX_GPU_BACKEND=HIP -DAMReX_AMD_ARCH=gfx908 -DAMReX_OMP=OFF -DAMReX_MPI=OFF -DAMReX_PRECISION=SINGLE -DAMReX_SPACEDIM=3 -DCMAKE_CXX_COMPILER=hipcc -DCMAKE_CXX_STANDARD=17 -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=/opt/rocm-4.3.0/llvm/bin/clang

That then still raises "Cannot find Symbol" though, so some llvm flags still being lost somewhere, maybe because ROCm 4.3.0 does not yet anticipate CMake 3.21-dev and thus the hip::device misses some flags or so.

User should for now not use a dev version of CMake on Spock, but just the latest stable release.

ax3l commented 2 years ago

For the "Cannot find Symbol" issue, one can strace the application like this (Crusher example):

export proj=aphXYZ  # change this to your OLCF project
alias runNode="srun -A $proj -J warpx -t 00:30:00 -p batch -N 1 -c 8 --ntasks-per-node=8"

cd build/bin
runNode strace ./warpx ../../Examples/Physics_applications/laser_acceleration/inputs_3d 2>&1 | grep -E '^open(at)?\(.*\.so'

Note latest Crusher instructions in WarpX: https://warpx.readthedocs.io/en/latest/install/hpc/crusher.html

BenWibking commented 1 year ago

I am getting the "Cannot find Symbol" issue on NCSA Delta's MI100 node. Unfortunately, it doesn't have the Cray compilers installed, so I can't follow the WarpX build instructions. Is there another workaround?

WeiqunZhang commented 1 year ago

gnu make

BenWibking commented 1 year ago

gnu make

Weirdly, although it complains, it also works if it set CMAKE_CXX_COMPILER to hipcc. Is this a CMake bug?

WeiqunZhang commented 1 year ago

I don't know. GNU make uses the hipcc wrapper instead of AMD's clang.

BenWibking commented 1 year ago

FYI- the hipcc/amdclang++ issue has been passed along to AMD's ROCm dev team.

rhaas80 commented 1 year ago

I ran into this as well (ORNL crusher this time). I used Cray's CC wrapper and cmake. Is there a solution other than using hipcc or GNU make? I am building for Cactus/CarpetX which itself is a complex build system so, given that it took me a couple days getting things to work with CC, I am hoping to not have to redo everything for hipcc ;-)