Closed denghuilu closed 10 months ago
I've encountered a very unusual issue: So far, this bug has appeared exclusively with the debug build while running with the 4090 GPU cards.
The situation has become even more peculiar: it runs after being compiled with Intel's oneAPI, but not with GCC. cmake build with intel works:
[denghui@LuDh-4090:build]$ CXX=icpc cmake -DUSE_CUDA=ON -DENABLE_FLOAT_FFTW=ON -DENABLE_DEEPKS=OFF -DENABLE_LIBXC=ON -DBUILD_TESTING=OFF -DCMAKE_BUILD_TYPE=Debug -DUSE_ELPA=OFF -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DCMAKE_VERBOSE_MAKEFILE=OFF -DCMAKE_INSTALL_PREFIX=/home/denghui/soft/abacus-develop ..
-- The CXX compiler identification is Intel 2021.10.0.20230609
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/intel/oneapi/compiler/2023.2.1/linux/bin/intel64/icpc - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.43.0")
-- Found git: attempting to get commit info...
-- Current commit hash: e135c71cb
-- Last commit date: Sat Jan 13 11:32:38 2024 +0000
-- Found Cereal: /usr/local/include
-- Found MPI_CXX: /opt/intel/oneapi/mpi/2021.10.0/lib/libmpicxx.so (found version "3.1")
-- Found MPI: TRUE (found version "3.1")
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Found OpenMP_CXX: -qopenmp (found version "5.0")
-- Found OpenMP: TRUE (found version "5.0")
-- Found CUDAToolkit: /usr/local/cuda/include (found version "12.2.140")
-- The CUDA compiler identification is NVIDIA 12.2.140
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Found IntelMKL: /opt/intel/oneapi/mkl/2023.2.0/lib/intel64/libmkl_intel_lp64.so
-- Found Libxc: version 5.1.7
-- Configuring done
-- Generating done
-- Build files have been written to: /home/denghui/abacus-develop/build
cmake build with gcc fail:
[denghui@LuDh-4090:build]$ cmake -DUSE_CUDA=ON -DENABLE_FLOAT_FFTW=ON -DENABLE_DEEPKS=OFF -DENABLE_LIBXC=OFF -DBUILD_TESTING=OFF -DCM
AKE_BUILD_TYPE=Debug -DUSE_ELPA=OFF -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DCMAKE_VERBOSE_MAKEFILE=OFF -DCMAKE_INSTALL_PREFI
X=/home/denghui/soft/abacus-develop ..
-- The CXX compiler identification is GNU 11.4.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.43.0")
-- Found git: attempting to get commit info...
-- Current commit hash: e135c71cb
-- Last commit date: Sat Jan 13 11:32:38 2024 +0000
-- Found Cereal: /usr/local/include
-- Found MPI_CXX: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_cxx.so (found version "3.1")
-- Found MPI: TRUE (found version "3.1")
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- Found CUDAToolkit: /usr/local/cuda/include (found version "12.2.140")
-- The CUDA compiler identification is NVIDIA 12.2.140
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Found FFTW3: /usr/lib/x86_64-linux-gnu/libfftw3_omp.so
-- Found LAPACK: /usr/lib/x86_64-linux-gnu/libopenblas.so
-- Found ScaLAPACK: /usr/lib/x86_64-linux-gnu/libscalapack-openmpi.so
-- Configuring done
-- Generating done
-- Build files have been written to: /home/denghui/abacus-develop/build
Describe the bug
When executing K-point parallelism on GPU environment, we may encounter a segmentation fault.
Expected behavior
Fix this issue.
To Reproduce
No response
Environment
No response
Additional Context
No response
Task list for Issue attackers (only for developers)