deepmodeling / abacus-develop

An electronic structure package based on either plane wave basis or numerical atomic orbitals.
http://abacus.ustc.edu.cn
GNU Lesser General Public License v3.0
173 stars 132 forks source link

The PW GPU code with K-point parallelism does not work well #3425

Closed denghuilu closed 10 months ago

denghuilu commented 10 months ago

Describe the bug

When executing K-point parallelism on GPU environment, we may encounter a segmentation fault.

[denghui@LuDh-4090:pw_Si2]$ /usr/bin/mpirun -n 2 /home/denghui/abacus-develop/build/abacus 
hwloc/linux: Ignoring PCI device with non-16bit domain.
Pass --enable-32bits-pci-domain to configure to support such devices
(warning: it would break the library ABI, don't enable unless really needed).
WARNING: Total thread number on this node mismatches with hardware availability. This may cause poor performance.
Info: Local MPI proc number: 2,OpenMP thread number: 2,Total thread number: 4,Local thread limit: 56

                              ABACUS v3.5.0

               Atomic-orbital Based Ab-initio Computation at UStc                    

                     Website: http://abacus.ustc.edu.cn/                             
               Documentation: https://abacus.deepmodeling.com/                       
                  Repository: https://github.com/abacusmodeling/abacus-develop       
                              https://github.com/deepmodeling/abacus-develop         
                      Commit: e135c71cb (Sat Jan 13 11:32:38 2024 +0000)

 Sun Jan 14 11:21:56 2024
 MAKE THE DIR         : OUT.ABACUS/
 RUNNING WITH DEVICE  : GPU / NVIDIA GeForce RTX 4090
 UNIFORM GRID DIM        : 36 * 36 * 36
 UNIFORM GRID DIM(BIG)   : 36 * 36 * 36
 DONE(1.49074    SEC) : SETUP UNITCELL
 DONE(1.54515    SEC) : SYMMETRY
 DONE(1.7073     SEC) : INIT K-POINTS
 ---------------------------------------------------------
 Self-consistent calculations for electrons
 ---------------------------------------------------------
 SPIN    KPOINTS         PROCESSORS  
 1       8               2           
 ---------------------------------------------------------
 Use plane wave basis
 ---------------------------------------------------------
 ELEMENT NATOM       XC          
 Si      2           
 ---------------------------------------------------------
 Initial plane wave basis and FFT box
 ---------------------------------------------------------
 DONE(1.71664    SEC) : INIT PLANEWAVE
 MEMORY FOR PSI (MB)  : 1.78162
 DONE(1.72678    SEC) : LOCAL POTENTIAL
 DONE(1.75068    SEC) : NON-LOCAL POTENTIAL
 DONE(1.7729     SEC) : INIT BASIS
 -------------------------------------------
 SELF-CONSISTENT : 
 -------------------------------------------
 START CHARGE      : atomic
 DONE(1.80318    SEC) : INIT SCF
 ITER   ETOT(eV)       EDIFF(eV)      DRHO       TIME(s)    
cuBLAS Assert: CUBLAS_STATUS_INVALID_VALUE /home/denghui/abacus-develop/source/module_hsolver/kernels/cuda/math_kernel_op.cu 855
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[53215,1],1]
  Exit code:    7
--------------------------------------------------------------------------

Expected behavior

Fix this issue.

To Reproduce

No response

Environment

No response

Additional Context

No response

Task list for Issue attackers (only for developers)

denghuilu commented 10 months ago

I've encountered a very unusual issue: So far, this bug has appeared exclusively with the debug build while running with the 4090 GPU cards.

denghuilu commented 10 months ago

The situation has become even more peculiar: it runs after being compiled with Intel's oneAPI, but not with GCC. cmake build with intel works:

[denghui@LuDh-4090:build]$ CXX=icpc cmake -DUSE_CUDA=ON -DENABLE_FLOAT_FFTW=ON -DENABLE_DEEPKS=OFF -DENABLE_LIBXC=ON -DBUILD_TESTING=OFF -DCMAKE_BUILD_TYPE=Debug -DUSE_ELPA=OFF -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DCMAKE_VERBOSE_MAKEFILE=OFF -DCMAKE_INSTALL_PREFIX=/home/denghui/soft/abacus-develop .. 
-- The CXX compiler identification is Intel 2021.10.0.20230609
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/intel/oneapi/compiler/2023.2.1/linux/bin/intel64/icpc - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.43.0") 
-- Found git: attempting to get commit info...
-- Current commit hash: e135c71cb
-- Last commit date: Sat Jan 13 11:32:38 2024 +0000
-- Found Cereal: /usr/local/include  
-- Found MPI_CXX: /opt/intel/oneapi/mpi/2021.10.0/lib/libmpicxx.so (found version "3.1") 
-- Found MPI: TRUE (found version "3.1")  
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- Found OpenMP_CXX: -qopenmp (found version "5.0") 
-- Found OpenMP: TRUE (found version "5.0")  
-- Found CUDAToolkit: /usr/local/cuda/include (found version "12.2.140") 
-- The CUDA compiler identification is NVIDIA 12.2.140
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Found IntelMKL: /opt/intel/oneapi/mkl/2023.2.0/lib/intel64/libmkl_intel_lp64.so  
-- Found Libxc: version 5.1.7
-- Configuring done
-- Generating done
-- Build files have been written to: /home/denghui/abacus-develop/build

cmake build with gcc fail:

[denghui@LuDh-4090:build]$ cmake -DUSE_CUDA=ON -DENABLE_FLOAT_FFTW=ON -DENABLE_DEEPKS=OFF -DENABLE_LIBXC=OFF -DBUILD_TESTING=OFF -DCM
AKE_BUILD_TYPE=Debug -DUSE_ELPA=OFF -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DCMAKE_VERBOSE_MAKEFILE=OFF -DCMAKE_INSTALL_PREFI
X=/home/denghui/soft/abacus-develop .. 
-- The CXX compiler identification is GNU 11.4.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.43.0") 
-- Found git: attempting to get commit info...
-- Current commit hash: e135c71cb
-- Last commit date: Sat Jan 13 11:32:38 2024 +0000
-- Found Cereal: /usr/local/include  
-- Found MPI_CXX: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_cxx.so (found version "3.1") 
-- Found MPI: TRUE (found version "3.1")  
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- Found OpenMP_CXX: -fopenmp (found version "4.5") 
-- Found OpenMP: TRUE (found version "4.5")  
-- Found CUDAToolkit: /usr/local/cuda/include (found version "12.2.140") 
-- The CUDA compiler identification is NVIDIA 12.2.140
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Found FFTW3: /usr/lib/x86_64-linux-gnu/libfftw3_omp.so  
-- Found LAPACK: /usr/lib/x86_64-linux-gnu/libopenblas.so  
-- Found ScaLAPACK: /usr/lib/x86_64-linux-gnu/libscalapack-openmpi.so  
-- Configuring done
-- Generating done
-- Build files have been written to: /home/denghui/abacus-develop/build