SmileiPIC / Smilei

Particle-in-cell code for plasma simulation
https://smileipic.github.io/Smilei
335 stars 119 forks source link

Problem installing Smilei for A100 GPUs #684

Closed tmiethlinger closed 8 months ago

tmiethlinger commented 8 months ago

Hello,

as the title says, I am trying to install Smilei for A100s on our cluster.

I also want to note that I also tried CUDA 12.0.0, but in the chat I was encouraged (some time ago) to either use CUDA 11.2 or 11.6, which is why I tried 11.4 (since we didn't have the other versions installed).

Do you see where the problem could lie? Thank you.

charlesprouveur commented 8 months ago

Hello, can you show the output of "module list"?

the file jean_zay_gpu_A100 should be used as inspiration, there are probably modifications needed for your system. For instance, we specify -L/gpfslocalsys/cuda/11.2/lib64/ : that would be useless for your cluster.

I see you have COMPILER_INFO : g++ , it should be nvc++ which leads me to think you don't have an nvhpc module loaded and your hdf5 module is also probably not compiled with it.

tmiethlinger commented 8 months ago

Thank you for your reply.

Here's the output of module list (here CUDA 11.8 is used):

Currently Loaded Modules:
1) release/23.04 (S)
2) GCCcore/11.3.0
3) zlib/1.2.12
4) binutils/2.38
5) GCC/11.3.0
6) numactl/2.0.14
7) XZ/5.2.5
8) libxml2/2.9.13
9) libpciaccess/0.16
10) hwloc/2.7.1
11) OpenSSL/1.1
12) libevent/2.1.12
13) UCX/1.12.1
14) libfabric/1.15.1
15) PMIx/4.1.2
16) UCC/1.0.0
17) OpenMPI/4.1.4
18) OpenBLAS/0.3.20
19) FlexiBLAS/3.2.0
20) FFTW/3.3.10
21) FFTW.MPI/3.3.10
22) ScaLAPACK/2.2.0-fb
23) foss/2022a
24) CUDA/11.8.0
25) ncurses/6.3
26) bzip2/1.0.8
27) cURL/7.83.0
28) libarchive/3.6.1
29) CMake/3.24.3
30) Szip/2.1.1
31) HDF5/1.13.2
charlesprouveur commented 8 months ago

As expected you do not have an nvhpc module loaded (which includes the nvc++ compiler that is required to compile the code) ; the cuda module alone only contains the nvcc compiler used to compile cuda files (but not the rest of the code). I recommend installing nvhpc 23.1 which comes with its own cuda and openmpi. You would only need to compile an hdf5 module with it to be ready in terms of dependencies.

tmiethlinger commented 8 months ago

Hi, so, I now successfully installed nvhpc 23.11. Which flags would I need to adjust in my machine file? This is what I have now as a machine file (tm_gpu_A100)

SMILEICXX.DEPS = nvcc
THRUSTCXX = nvcc
ACCELERATOR_GPU_FLAGS += -w
ACCELERATOR_GPU_FLAGS += -tp=zen3 -ta=tesla:cc80 -std=c++14  -lcurand -Mcudalib=curand
ACCELERATOR_GPU_KERNEL_FLAGS += -O3 --std c++14 $(DIRS:%=-I%)
ACCELERATOR_GPU_KERNEL_FLAGS += --expt-relaxed-constexpr
ACCELERATOR_GPU_KERNEL_FLAGS += $(shell $(PYTHONCONFIG) --includes)
ACCELERATOR_GPU_KERNEL_FLAGS += -arch=sm_80
ACCELERATOR_GPU_FLAGS        += -Minfo=accel # what is offloaded/copied
ACCELERATOR_GPU_FLAGS += -DSMILEI_OPENACC_MODE
ACCELERATOR_GPU_KERNEL_FLAGS += -DSMILEI_OPENACC_MODE
LDFLAGS += -ta=tesla:cc80 -std=c++14 -Mcudalib=curand -lcudart -lcurand -lacccuda -L/home/myuser/lib/nvidia/hpc_sdk/Linux_x86_64/23.11/cuda/12.3/lib64/
CXXFLAGS +=  -D__GCC_ATOMIC_TEST_AND_SET_TRUEVAL=1

but using make machine="tm_gpu_A100" config="gpu_nvidia noopenmp verbose" -j1 I get:

Checking dependencies for src/Tools/tabulatedFunctions.cpp
if [ ! -d "build/src/Tools" ]; then mkdir -p "build/src/Tools"; fi;
nvcc -D__GCC_ATOMIC_TEST_AND_SET_TRUEVAL=1 -D__VERSION=\"5.0-57-gc23dd350a-master\" -DOMPI_SKIP_MPICXX -std=c++14  -I/home/thmi817d/lib/hdf5_nvhpc/include -Isrc -Isrc/ElectroMagnBC -Isrc/SmileiMPI -Isrc/ParticleInjector -Isrc/DomainDecomposition -Isrc/Pusher -Isrc/Species -Isrc/Particles -Isrc/ElectroMagn -Isrc/Params -Isrc/picsar_interface -Isrc/Profiles -Isrc/Radiation -Isrc/Checkpoint -Isrc/ParticleBC -Isrc/Tools -Isrc/Field -Isrc/Collisions -Isrc/Interpolator -Isrc/ElectroMagnSolver -Isrc/MultiphotonBreitWheeler -Isrc/Ionization -Isrc/MovWindow -Isrc/Diagnostic -Isrc/Python -Isrc/Merging -Isrc/Projector -Isrc/Patch -Isrc/PartCompTime -Ibuild/src/Python -I/home/thmi817d/miniconda3/envs/smilei/include/python3.9 -I/home/thmi817d/miniconda3/envs/smilei/include/python3.9 -I/home/thmi817d/miniconda3/envs/smilei/lib/python3.9/site-packages/numpy/core/include -DSMILEI_USE_NUMPY -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -O3 -g -MF"build/src/Tools/tabulatedFunctions.d" -MM -MP -MT"build/src/Tools/tabulatedFunctions.d build/src/Tools/tabulatedFunctions.o" src/Tools/tabulatedFunctions.cpp
nvcc fatal   : Unknown option '-MFbuild/src/Tools/tabulatedFunctions.d'
Checking dependencies for src/Tools/PyTools.cpp
...

My current Smilei profile looks like:

NVARCH=`uname -s`_`uname -m`; export NVARCH
NVCOMPILERS=/home/myuser/lib/nvidia/hpc_sdk; export NVCOMPILERS
MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/23.11/compilers/man; export MANPATH
PATH=$NVCOMPILERS/$NVARCH/23.11/compilers/bin:$PATH; export PATH

export PATH=$NVCOMPILERS/$NVARCH/23.11/comm_libs/mpi/bin:$PATH
export MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/23.11/comm_libs/mpi/man

export HDF5_ROOT=$HOME/lib/hdf5_nvhpc
export LD_LIBRARY_PATH=$HDF5_ROOT/lib:$LD_LIBRARY_PATH

Do you see what the issue might be? The folders 23.11/compilers and 23.11/comm_libs exists, so that part should be correct I think.

charlesprouveur commented 8 months ago

You installed nvhpc 23.11 which might contain cuda 11.8 and/ or cuda 12.3 . for cuda 12.3 there are current known issue that we are working on. For cuda 11.8, modifications in the code might be needed ... which is why i recommended nvhpc 23.1 which you can get there https://developer.nvidia.com/nvidia-hpc-sdk-231-downloads.

To answer your questions:

mccoys commented 8 months ago

In the future, we ask that for support, you should use the chatroom https://app.element.io/#/room/!LQrdVpOJEohPSWMlmf:matrix.org

If you need more space to write your problem, use the discussions: https://github.com/SmileiPIC/Smilei/discussions/categories/q-a

Use issues here when you want to report an actual bug or feature request

mccoys commented 7 months ago

@tmiethlinger Note that the makefile has been modified to make GPU compilation easier. See this: https://smileipic.github.io/Smilei/Use/installation.html#setup-environment-variables-for-compilation and this: https://smileipic.github.io/Smilei/Use/installation.html#compilation-for-gpu-accelerated-nodes