ECP-WarpX / WarpX

WarpX is an advanced electromagnetic & electrostatic Particle-In-Cell code.
https://ecp-warpx.github.io
Other
297 stars 191 forks source link

CUDA compile error (Invalid memory reference) #4613

Open zhazhajust opened 9 months ago

zhazhajust commented 9 months ago

I set this cmake configcmake -S . -B cuda_build -DWarpX_COMPUTE=CUDA -DWarpX_DIMS="1;2;RZ;3", but get error when build the code.

[  2%] Building CUDA object _deps/localamrex-build/Src/CMakeFiles/amrex_2d.dir/Base/AMReX_FilND_C.cpp.o
nvcc error   : 'cicc' died due to signal 11 (Invalid memory reference)
nvcc error   : 'cicc' core dumped
gmake[2]: *** [_deps/localamrex-build/Src/CMakeFiles/amrex_3d.dir/Base/AMReX_FabArrayBase.cpp.o] Error 139
gmake[2]: *** Waiting for unfinished jobs....
[  2%] Building CUDA object _deps/localamrex-build/Src/CMakeFiles/amrex_2d.dir/Base/AMReX_NonLocalBC.cpp.o
nvcc error   : 'cicc' died due to signal 11 (Invalid memory reference)
nvcc error   : 'cicc' core dumped
gmake[2]: *** [_deps/localamrex-build/Src/CMakeFiles/amrex_1d.dir/Base/AMReX_FabArrayBase.cpp.o] Error 139
gmake[2]: *** Waiting for unfinished jobs....
[  2%] Building CUDA object _deps/localamrex-build/Src/CMakeFiles/amrex_2d.dir/Base/AMReX_PlotFileUtil.cpp.o
nvcc error   : 'cicc' died due to signal 11 (Invalid memory reference)
nvcc error   : 'cicc' core dumped
gmake[2]: *** [_deps/localamrex-build/Src/CMakeFiles/amrex_2d.dir/Base/AMReX_FabArrayBase.cpp.o] Error 139
gmake[2]: *** Waiting for unfinished jobs....
gmake[1]: *** [_deps/localamrex-build/Src/CMakeFiles/amrex_1d.dir/all] Error 2
gmake[1]: *** Waiting for unfinished jobs....
ax3l commented 8 months ago

Hi @zhazhajust,

what did you run exactly after cmake -S . -B ...? Which system is this?

The most likely issue of this is that you compile with too much requested parallelism in cmake --build build -j .... Lower the number in -j to not oversubscribe the available CPUs that compile your program.

HPC systems / login nodes / CI machines then usually kill the compiler, which is what you see here.

If this does not help, please report:

It is also possible that you encountered an internal bug in NVCC and just need to use another version of the CUDA toolkit (e.g., CUDA 11.7 or newer is what we recommend).

Let me know how that goes.

zhazhajust commented 8 months ago

Hello, thanks for the help.

I had try rm build directory and use just one thread to compile, but still raise this problem.

And the following is my compile script.

And the system is in a HPC with:

Linux version: CentOS Linux release 7.8.2003 (Core)

Linux kernel: Linux master 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

CPU: Intel(R) Xeon(R) Silver 4208 CPU @ 2.10GHz

GPU: V100 X 4

CUDA version: 11.0

GCC version: 9.3.0

MPI version openmpi 4.1.4

#/bin/env sh

module load anaconda3/3.10
module load gcc/9.3.0/warpx/23

cd /home/mypath

#YEE CUDA
# export CUDACXX=$(which nvcc)
cmake -S . -B cuda_build -DWarpX_COMPUTE=CUDA \
-DCMAKE_INSTALL_PREFIX=/my_warpx_path/warpx_cuda \
-DWarpX_DIMS="1;2;RZ;3" -DWarpX_PYTHON=ON \
-DWarpX_openpmd_internal=OFF

# cmake --build cuda_build -j 16
cmake --build cuda_build > compile.log 2>&1
cmake --install cuda_build

Maybe the question is that the CUDA version is too old, but the nvidia-smi show that V100 seems only support up to version 11.0, did it anyway to make the new version CUDA be installed with the V100 GPU?

pordyna commented 7 months ago

Hey I just got the same error yesterday while compiling a different CUDA software. I followed https://forums.developer.nvidia.com/t/cicc-compilation-error-and-debug-flag/27910/25 and removed --source-in-ptx --generate-line-info nvcc flags and it helped. Maybe WarpX is keeping source by default?