Open cyrush opened 10 months ago
Updates
WarpX on Frontier: Able to update to required modules from here https://warpx.readthedocs.io/en/latest/install/hpc/frontier.html#frontier-olcf but with cce/16.0.1
NekRS on Frontier: hip flags being added to non hip compilations leading to the following error
[ 12%] Building C object blt/tests/smoke/CMakeFiles/blt_hip_runtime_c_smoke.dir/blt_hip_runtime_c_smoke.c.o
cd /autofs/nccs-svm1_sw/summit/ums/ums010/2023_01/frontier/ascent_nekrs/build/camp-2022.10.1/blt/tests/smoke && /opt/cray/pe/craype/2.7.19/bin/CC -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -isystem /opt/rocm-5.3.0/include -isystem /opt/rocm-5.3.0/llvm/lib/clang/15.0.0/include/.. -Wall -Wextra -O3 -DNDEBUG -fPIE --rocm-path=/opt/rocm-5.3.0 -x hip --offload-arch=gfx90a -std=c++17 -MD -MT blt/tests/smoke/CMakeFiles/blt_hip_smoke.dir/blt_hip_smoke.cpp.o -MF CMakeFiles/blt_hip_smoke.dir/blt_hip_smoke.cpp.o.d -o CMakeFiles/blt_hip_smoke.dir/blt_hip_smoke.cpp.o -c /autofs/nccs-svm1_sw/summit/ums/ums010/2023_01/frontier/ascent_nekrs/camp-2022.10.1/extern/blt/tests/smoke/blt_hip_smoke.cpp
g++: error: unrecognized command-line option '--rocm-path=/opt/rocm-5.3.0'
g++: error: unrecognized command-line option '--offload-arch=gfx90a'
NekRS on Summit: Getting the following error, realized we need to use a newer cuda (cuda/11.1.1)
ptxas fatal : Unresolved extern function
from here https://github.com/LLNL/RAJA/blob/e78b1eb03cbcd9f954c9f54ea79b5f6f479bde45/include/RAJA/pattern/params/forall.hpp#L70
NekRS on Frontier building Camp:
[ 22%] Building CXX object blt/tests/smoke/CMakeFiles/blt_hip_smoke.dir/blt_hip_smoke.cpp.o
cd /autofs/nccs-svm1_sw/summit/ums/ums010/2023_01/frontier/ascent_nekrs/build/camp-2022.10.1/blt/tests/smoke && /opt/cray/pe/craype/2.7.19/bin/CC -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -isystem /opt/rocm-5.3.0/include -isystem /opt/rocm-5.3.0/llvm/lib/clang/15.0.0/include/.. -Wall -Wextra -O3 -DNDEBUG -fPIE --rocm-path=/opt/rocm-5.3.0 -x hip --offload-arch=gfx90a -std=c++17 -MD -MT blt/tests/smoke/CMakeFiles/blt_hip_smoke.dir/blt_hip_smoke.cpp.o -MF CMakeFiles/blt_hip_smoke.dir/blt_hip_smoke.cpp.o.d -o CMakeFiles/blt_hip_smoke.dir/blt_hip_smoke.cpp.o -c /autofs/nccs-svm1_sw/summit/ums/ums010/2023_01/frontier/ascent_nekrs/camp-2022.10.1/extern/blt/tests/smoke/blt_hip_smoke.cpp
g++: error: unrecognized command-line option '--rocm-path=/opt/rocm-5.3.0'
g++: error: unrecognized command-line option '--offload-arch=gfx90a'
make[2]: *** [blt/tests/smoke/CMakeFiles/blt_hip_smoke.dir/build.make:79: blt/tests/smoke/CMakeFiles/blt_hip_smoke.dir/blt_hip_smoke.cpp.o] Error 1
Bad news for gnu + hip on Frontier. Helpful info from our friend Ryan at OLCF: ..."the CC compiler wrapper for PrgEnv-gnu doesn't support HIP, because gcc (unlike clang) doesn't have support for HIP yet."
@mvictoras Unfortunately we haven't had the greatest success with these builds. We are road blocked on Frontier because the PrgEnv-gnu compiler wrappers do not support HIP. On Summit, I was able to get a build with a newer cuda version but the majority of my tests are failing with a cuda device error in vtkm.
@nicolemarsaglia I am able to run NekRS + Ascent on Frontier with the ascent module.
Here is the module I use
module load PrgEnv-gnu
module load craype-accel-amd-gfx90a
module load cray-mpich
module load rocm
module load ascent/0.8.0
module unload cray-libsci
module list
export MPICH_GPU_SUPPORT_ENABLED=1
Currently Loaded Modules:
1) craype-x86-trento 10) PrgEnv-gnu/8.3.3
2) libfabric/1.15.2.0 11) darshan-runtime/3.4.0
3) craype-network-ofi 12) hsi/default
4) perftools-base/22.12.0 13) DefApps/default
5) xpmem/2.6.2-2.5_2.22__gd067c3f.shasta 14) craype-accel-amd-gfx90a
6) cray-pmi/6.1.8 15) cray-mpich/8.1.23
7) gcc/12.2.0 16) rocm/5.3.0
8) craype/2.7.19 17) ascent/0.8.0
9) cray-dsmml/0.2.2
I'm also using my own branch of NekRS which is based on our latest release, v23. Let me know if you need any further information.
@yslan thanks for the info! I'm shocked there is an ascent module on Frontier. Unfortunately, ascent/0.8.0 will not have HIP/GPU support, but ascent/0.9.0 does, though that version is missing some key performance fixes.
ascent/0.8.0 will not have HIP/GPU support
Hmm.... I have been running NekRS + Ascent on Frontier up to 75 Frontier nodes, and it runs pretty well.
NekRS is running on GPU for sure and I found from our interface that we pass the GPU pointer to Ascent. I have hard time believing it can get the data if Ascent is running on the host.
Need @mvictoras for double checking what is actually happening.
On the other hand, do you happen to know which version of Ascent is in that module? I can find the path to the installed location but I can't find the source code.
/sw/frontier/spack-envs/base/opt/cray-sles15-zen3/gcc-12.2.0/ascent-0.8.0-6j27g2kx4a3zpg5ojh27ffhqsuurodzy/
@yslan those are facility builds created with spack, so I think spack source stage is probably gone.
CUDA vs HIP runtimes are different with respect GPU vs host access pitfalls.
You could confirm by running a profiler to look at GPU work.
Note: We have only been using build_ascent for HIP builds. We want to have spack support for HIP, but it was changing so rapidly we had to have a stable way to build for Frontier.
It looks like the one I was using is rendered with OpenMP Offload.
I see - I think it is using OpenMP on the CPU not GPU. GPU build should improve performance.
I think it is using OpenMP on the CPU not GPU.
Is there anyway to confirm this? On my end, I will try to setup timer and build our own benchmark.
GPU build should improve performance
For HIP, Camp's build system seems to only support LLVM right now and we need GNU.
We only sent a GPU pointer to Ascent. Does OpenMP manage to use that to automatically run on CPU?
frontier public install for WarpX
Compatible with their new process.
https://warpx.readthedocs.io/en/latest/install/hpc/frontier.html
Also add info on how to build to WarpX Docs
NekRS requests (2023/08/04)
gnu + cuda builds Summit
Summit (mpicc/mpic++/mpif77)
module load gcc make cuda
gnu + hip builds on Frontier
Frontier (cc/CC/ftn) module load PrgEnv-gnu module load craype-accel-amd-gfx90a module load cray-mpich module load rocm module unload cray-libsci