The PW GPU code with K-point parallelism does not work well

denghuilu commented 10 months ago

Describe the bug

When executing K-point parallelism on GPU environment, we may encounter a segmentation fault.

[denghui@LuDh-4090:pw_Si2]$ /usr/bin/mpirun -n 2 /home/denghui/abacus-develop/build/abacus 
hwloc/linux: Ignoring PCI device with non-16bit domain.
Pass --enable-32bits-pci-domain to configure to support such devices
(warning: it would break the library ABI, don't enable unless really needed).
WARNING: Total thread number on this node mismatches with hardware availability. This may cause poor performance.
Info: Local MPI proc number: 2,OpenMP thread number: 2,Total thread number: 4,Local thread limit: 56

                              ABACUS v3.5.0

               Atomic-orbital Based Ab-initio Computation at UStc                    

                     Website: http://abacus.ustc.edu.cn/                             
               Documentation: https://abacus.deepmodeling.com/                       
                  Repository: https://github.com/abacusmodeling/abacus-develop       
                              https://github.com/deepmodeling/abacus-develop         
                      Commit: e135c71cb (Sat Jan 13 11:32:38 2024 +0000)

 Sun Jan 14 11:21:56 2024
 MAKE THE DIR         : OUT.ABACUS/
 RUNNING WITH DEVICE  : GPU / NVIDIA GeForce RTX 4090
 UNIFORM GRID DIM        : 36 * 36 * 36
 UNIFORM GRID DIM(BIG)   : 36 * 36 * 36
 DONE(1.49074    SEC) : SETUP UNITCELL
 DONE(1.54515    SEC) : SYMMETRY
 DONE(1.7073     SEC) : INIT K-POINTS
 ---------------------------------------------------------
 Self-consistent calculations for electrons
 ---------------------------------------------------------
 SPIN    KPOINTS         PROCESSORS  
 1       8               2           
 ---------------------------------------------------------
 Use plane wave basis
 ---------------------------------------------------------
 ELEMENT NATOM       XC          
 Si      2           
 ---------------------------------------------------------
 Initial plane wave basis and FFT box
 ---------------------------------------------------------
 DONE(1.71664    SEC) : INIT PLANEWAVE
 MEMORY FOR PSI (MB)  : 1.78162
 DONE(1.72678    SEC) : LOCAL POTENTIAL
 DONE(1.75068    SEC) : NON-LOCAL POTENTIAL
 DONE(1.7729     SEC) : INIT BASIS
 -------------------------------------------
 SELF-CONSISTENT : 
 -------------------------------------------
 START CHARGE      : atomic
 DONE(1.80318    SEC) : INIT SCF
 ITER   ETOT(eV)       EDIFF(eV)      DRHO       TIME(s)    
cuBLAS Assert: CUBLAS_STATUS_INVALID_VALUE /home/denghui/abacus-develop/source/module_hsolver/kernels/cuda/math_kernel_op.cu 855
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[53215,1],1]
  Exit code:    7
--------------------------------------------------------------------------

Expected behavior

Fix this issue.

To Reproduce

No response

Environment

No response

Additional Context

No response

Task list for Issue attackers (only for developers)

[X] Verify the issue is not a duplicate.
[X] Describe the bug.
[X] Steps to reproduce.
[X] Expected behavior.
[X] Error message.
[X] Environment details.
[X] Additional context.
[ ] Assign a priority level (low, medium, high, urgent).
[ ] Assign the issue to a team member.
[ ] Label the issue with relevant tags.
[ ] Identify possible related issues.
[ ] Create a unit test or automated test to reproduce the bug (if applicable).
[ ] Fix the bug.
[ ] Test the fix.
[ ] Update documentation (if necessary).
[ ] Close the issue and inform the reporter (if applicable).

denghuilu commented 10 months ago

I've encountered a very unusual issue: So far, this bug has appeared exclusively with the debug build while running with the 4090 GPU cards.

denghuilu commented 10 months ago

The situation has become even more peculiar: it runs after being compiled with Intel's oneAPI, but not with GCC. cmake build with intel works:

[denghui@LuDh-4090:build]$ CXX=icpc cmake -DUSE_CUDA=ON -DENABLE_FLOAT_FFTW=ON -DENABLE_DEEPKS=OFF -DENABLE_LIBXC=ON -DBUILD_TESTING=OFF -DCMAKE_BUILD_TYPE=Debug -DUSE_ELPA=OFF -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DCMAKE_VERBOSE_MAKEFILE=OFF -DCMAKE_INSTALL_PREFIX=/home/denghui/soft/abacus-develop .. 
-- The CXX compiler identification is Intel 2021.10.0.20230609
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/intel/oneapi/compiler/2023.2.1/linux/bin/intel64/icpc - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.43.0") 
-- Found git: attempting to get commit info...
-- Current commit hash: e135c71cb
-- Last commit date: Sat Jan 13 11:32:38 2024 +0000
-- Found Cereal: /usr/local/include  
-- Found MPI_CXX: /opt/intel/oneapi/mpi/2021.10.0/lib/libmpicxx.so (found version "3.1") 
-- Found MPI: TRUE (found version "3.1")  
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- Found OpenMP_CXX: -qopenmp (found version "5.0") 
-- Found OpenMP: TRUE (found version "5.0")  
-- Found CUDAToolkit: /usr/local/cuda/include (found version "12.2.140") 
-- The CUDA compiler identification is NVIDIA 12.2.140
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Found IntelMKL: /opt/intel/oneapi/mkl/2023.2.0/lib/intel64/libmkl_intel_lp64.so  
-- Found Libxc: version 5.1.7
-- Configuring done
-- Generating done
-- Build files have been written to: /home/denghui/abacus-develop/build

cmake build with gcc fail:

[denghui@LuDh-4090:build]$ cmake -DUSE_CUDA=ON -DENABLE_FLOAT_FFTW=ON -DENABLE_DEEPKS=OFF -DENABLE_LIBXC=OFF -DBUILD_TESTING=OFF -DCM
AKE_BUILD_TYPE=Debug -DUSE_ELPA=OFF -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DCMAKE_VERBOSE_MAKEFILE=OFF -DCMAKE_INSTALL_PREFI
X=/home/denghui/soft/abacus-develop .. 
-- The CXX compiler identification is GNU 11.4.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.43.0") 
-- Found git: attempting to get commit info...
-- Current commit hash: e135c71cb
-- Last commit date: Sat Jan 13 11:32:38 2024 +0000
-- Found Cereal: /usr/local/include  
-- Found MPI_CXX: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_cxx.so (found version "3.1") 
-- Found MPI: TRUE (found version "3.1")  
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- Found OpenMP_CXX: -fopenmp (found version "4.5") 
-- Found OpenMP: TRUE (found version "4.5")  
-- Found CUDAToolkit: /usr/local/cuda/include (found version "12.2.140") 
-- The CUDA compiler identification is NVIDIA 12.2.140
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Found FFTW3: /usr/lib/x86_64-linux-gnu/libfftw3_omp.so  
-- Found LAPACK: /usr/lib/x86_64-linux-gnu/libopenblas.so  
-- Found ScaLAPACK: /usr/lib/x86_64-linux-gnu/libscalapack-openmpi.so  
-- Configuring done
-- Generating done
-- Build files have been written to: /home/denghui/abacus-develop/build

deepmodeling / abacus-develop