Closed tmiethlinger closed 8 months ago
Hello, can you show the output of "module list"?
the file jean_zay_gpu_A100 should be used as inspiration, there are probably modifications needed for your system. For instance, we specify -L/gpfslocalsys/cuda/11.2/lib64/ : that would be useless for your cluster.
I see you have COMPILER_INFO : g++ , it should be nvc++ which leads me to think you don't have an nvhpc module loaded and your hdf5 module is also probably not compiled with it.
Thank you for your reply.
Here's the output of module list (here CUDA 11.8 is used):
Currently Loaded Modules:
1) release/23.04 (S)
2) GCCcore/11.3.0
3) zlib/1.2.12
4) binutils/2.38
5) GCC/11.3.0
6) numactl/2.0.14
7) XZ/5.2.5
8) libxml2/2.9.13
9) libpciaccess/0.16
10) hwloc/2.7.1
11) OpenSSL/1.1
12) libevent/2.1.12
13) UCX/1.12.1
14) libfabric/1.15.1
15) PMIx/4.1.2
16) UCC/1.0.0
17) OpenMPI/4.1.4
18) OpenBLAS/0.3.20
19) FlexiBLAS/3.2.0
20) FFTW/3.3.10
21) FFTW.MPI/3.3.10
22) ScaLAPACK/2.2.0-fb
23) foss/2022a
24) CUDA/11.8.0
25) ncurses/6.3
26) bzip2/1.0.8
27) cURL/7.83.0
28) libarchive/3.6.1
29) CMake/3.24.3
30) Szip/2.1.1
31) HDF5/1.13.2
As expected you do not have an nvhpc module loaded (which includes the nvc++ compiler that is required to compile the code) ; the cuda module alone only contains the nvcc compiler used to compile cuda files (but not the rest of the code). I recommend installing nvhpc 23.1 which comes with its own cuda and openmpi. You would only need to compile an hdf5 module with it to be ready in terms of dependencies.
Hi,
so, I now successfully installed nvhpc 23.11.
Which flags would I need to adjust in my machine file? This is what I have now as a machine file (tm_gpu_A100
)
SMILEICXX.DEPS = nvcc
THRUSTCXX = nvcc
ACCELERATOR_GPU_FLAGS += -w
ACCELERATOR_GPU_FLAGS += -tp=zen3 -ta=tesla:cc80 -std=c++14 -lcurand -Mcudalib=curand
ACCELERATOR_GPU_KERNEL_FLAGS += -O3 --std c++14 $(DIRS:%=-I%)
ACCELERATOR_GPU_KERNEL_FLAGS += --expt-relaxed-constexpr
ACCELERATOR_GPU_KERNEL_FLAGS += $(shell $(PYTHONCONFIG) --includes)
ACCELERATOR_GPU_KERNEL_FLAGS += -arch=sm_80
ACCELERATOR_GPU_FLAGS += -Minfo=accel # what is offloaded/copied
ACCELERATOR_GPU_FLAGS += -DSMILEI_OPENACC_MODE
ACCELERATOR_GPU_KERNEL_FLAGS += -DSMILEI_OPENACC_MODE
LDFLAGS += -ta=tesla:cc80 -std=c++14 -Mcudalib=curand -lcudart -lcurand -lacccuda -L/home/myuser/lib/nvidia/hpc_sdk/Linux_x86_64/23.11/cuda/12.3/lib64/
CXXFLAGS += -D__GCC_ATOMIC_TEST_AND_SET_TRUEVAL=1
but using make machine="tm_gpu_A100" config="gpu_nvidia noopenmp verbose" -j1
I get:
Checking dependencies for src/Tools/tabulatedFunctions.cpp
if [ ! -d "build/src/Tools" ]; then mkdir -p "build/src/Tools"; fi;
nvcc -D__GCC_ATOMIC_TEST_AND_SET_TRUEVAL=1 -D__VERSION=\"5.0-57-gc23dd350a-master\" -DOMPI_SKIP_MPICXX -std=c++14 -I/home/thmi817d/lib/hdf5_nvhpc/include -Isrc -Isrc/ElectroMagnBC -Isrc/SmileiMPI -Isrc/ParticleInjector -Isrc/DomainDecomposition -Isrc/Pusher -Isrc/Species -Isrc/Particles -Isrc/ElectroMagn -Isrc/Params -Isrc/picsar_interface -Isrc/Profiles -Isrc/Radiation -Isrc/Checkpoint -Isrc/ParticleBC -Isrc/Tools -Isrc/Field -Isrc/Collisions -Isrc/Interpolator -Isrc/ElectroMagnSolver -Isrc/MultiphotonBreitWheeler -Isrc/Ionization -Isrc/MovWindow -Isrc/Diagnostic -Isrc/Python -Isrc/Merging -Isrc/Projector -Isrc/Patch -Isrc/PartCompTime -Ibuild/src/Python -I/home/thmi817d/miniconda3/envs/smilei/include/python3.9 -I/home/thmi817d/miniconda3/envs/smilei/include/python3.9 -I/home/thmi817d/miniconda3/envs/smilei/lib/python3.9/site-packages/numpy/core/include -DSMILEI_USE_NUMPY -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -O3 -g -MF"build/src/Tools/tabulatedFunctions.d" -MM -MP -MT"build/src/Tools/tabulatedFunctions.d build/src/Tools/tabulatedFunctions.o" src/Tools/tabulatedFunctions.cpp
nvcc fatal : Unknown option '-MFbuild/src/Tools/tabulatedFunctions.d'
Checking dependencies for src/Tools/PyTools.cpp
...
My current Smilei profile looks like:
NVARCH=`uname -s`_`uname -m`; export NVARCH
NVCOMPILERS=/home/myuser/lib/nvidia/hpc_sdk; export NVCOMPILERS
MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/23.11/compilers/man; export MANPATH
PATH=$NVCOMPILERS/$NVARCH/23.11/compilers/bin:$PATH; export PATH
export PATH=$NVCOMPILERS/$NVARCH/23.11/comm_libs/mpi/bin:$PATH
export MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/23.11/comm_libs/mpi/man
export HDF5_ROOT=$HOME/lib/hdf5_nvhpc
export LD_LIBRARY_PATH=$HDF5_ROOT/lib:$LD_LIBRARY_PATH
Do you see what the issue might be? The folders 23.11/compilers
and 23.11/comm_libs
exists, so that part should be correct I think.
You installed nvhpc 23.11 which might contain cuda 11.8 and/ or cuda 12.3 . for cuda 12.3 there are current known issue that we are working on. For cuda 11.8, modifications in the code might be needed ... which is why i recommended nvhpc 23.1 which you can get there https://developer.nvidia.com/nvidia-hpc-sdk-231-downloads.
To answer your questions:
change SMILEICXX.DEPS to nvc++
The -ta=tesla:cc80 option works with nvhpc 23.1 but not nvhpc >23.4 , you would need different options, which is another reason to use the older nvhpc ( you can look at the machine file ruche_gpu2 as an example where we compiled and executed with nvhpc 23.9 and cuda 11.8, it's possible but some executables had issues so i do not recommend it at this time.
The "error" messages during the dependency check can be ignored, it is not an issue.
The rest should be fine.
In the future, we ask that for support, you should use the chatroom https://app.element.io/#/room/!LQrdVpOJEohPSWMlmf:matrix.org
If you need more space to write your problem, use the discussions: https://github.com/SmileiPIC/Smilei/discussions/categories/q-a
Use issues here when you want to report an actual bug or feature request
@tmiethlinger Note that the makefile has been modified to make GPU compilation easier. See this: https://smileipic.github.io/Smilei/Use/installation.html#setup-environment-variables-for-compilation and this: https://smileipic.github.io/Smilei/Use/installation.html#compilation-for-gpu-accelerated-nodes
Hello,
as the title says, I am trying to install Smilei for A100s on our cluster.
heads/master-0-g4f145b341
and many more errors/files with this
nvcc fatal : Unknown option '-Wno-reorder'
messageldd: ./smilei: No such file or directory
I also want to note that I also tried CUDA 12.0.0, but in the chat I was encouraged (some time ago) to either use CUDA 11.2 or 11.6, which is why I tried 11.4 (since we didn't have the other versions installed).
Do you see where the problem could lie? Thank you.