deepmodeling / abacus-develop

An electronic structure package based on either plane wave basis or numerical atomic orbitals.
http://abacus.ustc.edu.cn
GNU Lesser General Public License v3.0
163 stars 128 forks source link

DCU results are not consistent with CPU results (Device) #4017

Closed pxlxingliang closed 3 months ago

pxlxingliang commented 5 months ago

Describe the bug

Below two examples have the large energy difference between results by DCU and CPU.

             converge       energy device
cpu/075_NCe      True -1561.087565    cpu
cpu/084_PLa      True -1061.115456    cpu
dcu/075_NCe      True -1564.795349    gpu
dcu/084_PLa      True -1062.324460    gpu

The log of 075 DCU is :

 START CHARGE      : atomic
 DONE(1.6609     SEC) : INIT SCF
 ITER   ETOT(eV)       EDIFF(eV)      DRHO       TIME(s)
 CG1    -1.549556e+03  0.000000e+00   1.271e+00  5.142e+01
 CG2    -1.529498e+03  2.005724e+01   2.598e+01  3.501e+01
 CG3    -1.564413e+03  -3.491479e+01  3.932e-01  2.706e+01
 CG4    -1.564398e+03  1.487653e-02   1.507e-01  2.179e+01
 CG5    -1.564778e+03  -3.793974e-01  6.994e-04  2.375e+01
 CG6    -1.564794e+03  -1.590618e-02  3.525e-03  4.718e+01
 CG7    -1.564795e+03  -1.010381e-03  7.883e-04  2.303e+01
 CG8    -1.564795e+03  -8.386105e-04  1.463e-04  2.177e+01
 CG9    -1.564795e+03  7.329665e-05   1.246e-04  1.963e+01
 CG10   -1.564796e+03  -3.411917e-04  3.197e-05  1.875e+01
 CG11   -1.564795e+03  1.926494e-04   3.556e-06  2.371e+01
 CG12   -1.564796e+03  -4.020385e-04  3.804e-06  2.491e+01
 CG13   -1.564796e+03  2.979026e-04   6.320e-06  2.481e+01
 CG14   -1.564795e+03  1.859469e-04   3.815e-07  1.978e+01
 CG15   -1.564795e+03  5.911349e-05   2.968e-09  2.591e+01

The log of 075 CPU is :

 ITER   ETOT(eV)       EDIFF(eV)      DRHO       TIME(s)
 DA1    -1.551602e+03  0.000000e+00   1.181e+00  3.926e+01
 DA2    -1.525382e+03  2.622072e+01   2.602e+01  3.144e+01
 DA3    -1.561370e+03  -3.598852e+01  3.821e-01  2.922e+01
 DA4    -1.560817e+03  5.530934e-01   1.890e-01  1.882e+01
 DA5    -1.561091e+03  -2.739730e-01  7.426e-03  2.282e+01
 DA6    -1.561077e+03  1.384721e-02   4.199e-03  2.302e+01
 DA7    -1.561088e+03  -1.071650e-02  1.216e-04  2.289e+01
 DA8    -1.561087e+03  1.108855e-03   3.370e-04  2.999e+01
 DA9    -1.561088e+03  -7.190405e-04  2.464e-05  2.444e+01
 DA10   -1.561088e+03  -4.450049e-06  5.798e-06  1.907e+01
 DA11   -1.561088e+03  -1.286566e-05  1.279e-07  2.175e+01
 DA12   -1.561088e+03  -6.877493e-07  1.056e-07  2.921e+01
 DA13   -1.561088e+03  -5.208270e-08  2.526e-08  1.985e+01
 DA14   -1.561088e+03  -1.935690e-08  1.054e-09  1.932e+01

d.zip

Expected behavior

No response

To Reproduce

No response

Environment

No response

Additional Context

No response

Task list for Issue attackers (only for developers)

WHUweiqingzhou commented 5 months ago

@denghuilu could you have a look?

denghuilu commented 4 months ago

This issue may be related to improper use of the DTK environment: I used the latest version of DTK and found that the DCU results align with those from the CPU and GPU.

Below is my test environment:

[aisi@j18r1n12:084_PLa]$ module list 
Currently Loaded Modulefiles:
  1) compiler/devtoolset/7.3.1   2) compiler/rocm/dtk-23.10     3) compiler/cmake/3.23.3       4) mpi/hpcx/2.11.0/gcc-7.3.1

With the cmake command:

CC=clang CXX=clang++ cmake -DUSE_OPENMP=OFF -DENABLE_LCAO=OFF -DFFTW3_DIR=/public/home/aisi/users/denghui/abacus/soft/fftw-3.3.9 -DLAPACK_DIR=/public/home/aisi/users/denghui/abacus/soft/OpenBLAS -DCMAKE_VERBOSE_MAKEFILE=true -DUSE_ROCM=ON -DCOMMIT_INFO=OFF ..

And the corresponding executable file info:

[aisi@j18r1n12:084_PLa]$ ldd -r ../../abacus-develop-2024-04-26/abacus-develop/build/abacus_pw 
        linux-vdso.so.1 =>  (0x00002b562e8cc000)
        libfftw3.so.3 => /public/home/aisi/users/denghui/abacus/soft/fftw-3.3.9-shared/lib/libfftw3.so.3 (0x00002b562f40f000)
        libgfortran.so.4 => /lib64/libgfortran.so.4 (0x00002b562f725000)
        libm.so.6 => /lib64/libm.so.6 (0x00002b562fb01000)
        libmpi.so.40 => /opt/hpc/software/mpi/hpcx/v2.11.0/gcc-7.3.1/lib/libmpi.so.40 (0x00002b562fe03000)
        libopenblas.so.0 => /public/home/aisi/users/denghui/abacus/soft/OpenBLAS/lib/libopenblas.so.0 (0x00002b5630133000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b56310bb000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002b56312d7000)
        libgalaxyhip.so.5 => /public/software/compiler/rocm/dtk-23.10/lib/libgalaxyhip.so.5 (0x00002b56314db000)
        libhipfft.so => /public/software/compiler/rocm/dtk-23.10/lib/libhipfft.so (0x00002b5639b8d000)
        libhipblas.so.0 => /public/software/compiler/rocm/dtk-23.10/lib/libhipblas.so.0 (0x00002b5639e0d000)
        libhipsolver.so.0 => /public/software/compiler/rocm/dtk-23.10/lib/libhipsolver.so.0 (0x00002b563a06b000)
        libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00002b563a2b1000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002b563a5b8000)
        libc.so.6 => /lib64/libc.so.6 (0x00002b563a7ce000)
        libquadmath.so.0 => /lib64/libquadmath.so.0 (0x00002b563ab9b000)
        /lib64/ld-linux-x86-64.so.2 (0x00002b562e8aa000)
        libopen-rte.so.40 => /opt/hpc/software/mpi/hpcx/v2.11.0/gcc-7.3.1/lib/libopen-rte.so.40 (0x00002b563add7000)
        libopen-pal.so.40 => /opt/hpc/software/mpi/hpcx/v2.11.0/gcc-7.3.1/lib/libopen-pal.so.40 (0x00002b563b08d000)
        librt.so.1 => /lib64/librt.so.1 (0x00002b563b342000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00002b563b54a000)
        libz.so.1 => /lib64/libz.so.1 (0x00002b563b74d000)
        libhwloc.so.15 => /opt/hpc/software/mpi/hwloc/lib/libhwloc.so.15 (0x00002b563b963000)
        libudev.so.1 => /lib64/libudev.so.1 (0x00002b563bbad000)
        libxml2.so.2 => /lib64/libxml2.so.2 (0x00002b563bdc3000)
        libevent_core-2.0.so.5 => /lib64/libevent_core-2.0.so.5 (0x00002b563c12d000)
        libevent_pthreads-2.0.so.5 => /lib64/libevent_pthreads-2.0.so.5 (0x00002b563c358000)
        libgfortran.so.3 => /lib64/libgfortran.so.3 (0x00002b563c55b000)
        libelf.so.1 => /lib64/libelf.so.1 (0x00002b563c87d000)
        libnuma.so.1 => /lib64/libnuma.so.1 (0x00002b563ca95000)
        libdrm.so.2 => /lib64/libdrm.so.2 (0x00002b563cca1000)
        libdrm_amdgpu.so.1 => /lib64/libdrm_amdgpu.so.1 (0x00002b563ceb3000)
        libhsa-runtime64.so.1 => /public/software/compiler/rocm/dtk-23.10/lib/libhsa-runtime64.so.1 (0x00002b563d0bd000)
        libtinfo.so.5 => /lib64/libtinfo.so.5 (0x00002b563d4fb000)
        librocfft.so.0 => /public/software/compiler/rocm/dtk-23.10/lib/librocfft.so.0 (0x00002b563d725000)
        librocsolver.so.0 => /public/software/compiler/rocm/dtk-23.10/lib/librocsolver.so.0 (0x00002b563dc62000)
        librocblas.so.0 => /public/software/compiler/rocm/dtk-23.10/lib/librocblas.so.0 (0x00002b564c6dc000)
        libcap.so.2 => /lib64/libcap.so.2 (0x00002b56502cc000)
        libdw.so.1 => /lib64/libdw.so.1 (0x00002b56504d1000)
        liblzma.so.5 => /lib64/liblzma.so.5 (0x00002b5650720000)
        librocfft-device-0.so.0 => /public/software/compiler/rocm/dtk-23.10/lib/librocfft-device-0.so.0 (0x00002b5650946000)
        librocfft-device-1.so.0 => /public/software/compiler/rocm/dtk-23.10/lib/librocfft-device-1.so.0 (0x00002b565fc9e000)
        librocfft-device-2.so.0 => /public/software/compiler/rocm/dtk-23.10/lib/librocfft-device-2.so.0 (0x00002b5670d33000)
        librocfft-device-3.so.0 => /public/software/compiler/rocm/dtk-23.10/lib/librocfft-device-3.so.0 (0x00002b568190c000)
        libomp.so => /public/software/compiler/rocm/dtk-23.10/llvm/lib/libomp.so (0x00002b568fc8b000)
        libattr.so.1 => /lib64/libattr.so.1 (0x00002b568ff7e000)
        libbz2.so.1 => /lib64/libbz2.so.1 (0x00002b5690183000)
pxlxingliang commented 4 months ago

Hi @denghuilu, I try to compile abacus with compiler/rocm/dtk-23.10, but I get the below errors:

/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: /public/home/abacus/libs/fftw-3.3.10/install/lib/libfftw3.a(ct-hc2c-direct.o): relocation R_X86_64_32 against `.rodata.str1.8' can not be used when making a shared object; recompile with -fPIC
/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: /public/home/abacus/libs/fftw-3.3.10/install/lib/libfftw3.a(ct-hc2c.o): relocation R_X86_64_32S against `.text' can not be used when making a shared object; recompile with -fPIC
/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: /public/home/abacus/libs/fftw-3.3.10/install/lib/libfftw3.a(direct-r2c.o): relocation R_X86_64_32 against `.text' can not be used when making a shared object; recompile with -fPIC
/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: /public/home/abacus/libs/fftw-3.3.10/install/lib/libfftw3.a(direct-r2r.o): relocation R_X86_64_32 against `.text' can not be used when making a shared object; recompile with -fPIC
/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: /public/home/abacus/libs/fftw-3.3.10/install/lib/libfftw3.a(direct2.o): relocation R_X86_64_32 against `.text' can not be used when making a shared object; recompile with -fPIC
/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: /public/home/abacus/libs/fftw-3.3.10/install/lib/libfftw3.a(hc2hc-direct.o): relocation R_X86_64_32 against `.text' can not be used when making a shared object; recompile with -fPIC
/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: final link failed: Nonrepresentable section on output
clang-15: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [CMakeFiles/abacus_pw.dir/build.make:710: abacus_pw] Error 1
make[1]: *** [CMakeFiles/Makefile2:791: CMakeFiles/abacus_pw.dir/all] Error 2
make: *** [Makefile:136: all] Error 2
denghuilu commented 4 months ago

Hi @denghuilu, I try to compile abacus with compiler/rocm/dtk-23.10, but I get the below errors:

/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: /public/home/abacus/libs/fftw-3.3.10/install/lib/libfftw3.a(ct-hc2c-direct.o): relocation R_X86_64_32 against `.rodata.str1.8' can not be used when making a shared object; recompile with -fPIC
/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: /public/home/abacus/libs/fftw-3.3.10/install/lib/libfftw3.a(ct-hc2c.o): relocation R_X86_64_32S against `.text' can not be used when making a shared object; recompile with -fPIC
/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: /public/home/abacus/libs/fftw-3.3.10/install/lib/libfftw3.a(direct-r2c.o): relocation R_X86_64_32 against `.text' can not be used when making a shared object; recompile with -fPIC
/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: /public/home/abacus/libs/fftw-3.3.10/install/lib/libfftw3.a(direct-r2r.o): relocation R_X86_64_32 against `.text' can not be used when making a shared object; recompile with -fPIC
/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: /public/home/abacus/libs/fftw-3.3.10/install/lib/libfftw3.a(direct2.o): relocation R_X86_64_32 against `.text' can not be used when making a shared object; recompile with -fPIC
/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: /public/home/abacus/libs/fftw-3.3.10/install/lib/libfftw3.a(hc2hc-direct.o): relocation R_X86_64_32 against `.text' can not be used when making a shared object; recompile with -fPIC
/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: final link failed: Nonrepresentable section on output
clang-15: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [CMakeFiles/abacus_pw.dir/build.make:710: abacus_pw] Error 1
make[1]: *** [CMakeFiles/Makefile2:791: CMakeFiles/abacus_pw.dir/all] Error 2
make: *** [Makefile:136: all] Error 2

The error message suggests that FFTW's shared libraries are required to compile ABACUS in this environment. Please consider recompiling FFTW with the shared libraries option enabled.

pxlxingliang commented 4 months ago

I @denghuilu, I use the previous compiled method, and re-run 075_NCe by using the latest code, and this time the DCU results are almost same as the results of CPU:

                              ABACUS v3.6.2

               Atomic-orbital Based Ab-initio Computation at UStc                    

                     Website: http://abacus.ustc.edu.cn/                             
               Documentation: https://abacus.deepmodeling.com/                       
                  Repository: https://github.com/abacusmodeling/abacus-develop       
                              https://github.com/deepmodeling/abacus-develop         
                      Commit: 7f84a09 (Fri Apr 26 11:07:47 2024 +0000)

 Sun Apr 28 10:08:54 2024
 MAKE THE DIR         : OUT.ABACUS/
 RUNNING WITH DEVICE  : GPU / Device 66a1

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 Warning: the number of valence electrons in pseudopotential > 4 for Ce: [Xe] 4f1 5d1 6s2
 Pseudopotentials with additional electrons can yield (more) accurate outcomes, but may be less efficient.
 If you're confident that your chosen pseudopotential is appropriate, you can safely ignore this warning.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 UNIFORM GRID DIM        : 45 * 45 * 45
 UNIFORM GRID DIM(BIG)   : 45 * 45 * 45
 DONE(0.37021    SEC) : SETUP UNITCELL
 DONE(0.548461   SEC) : INIT K-POINTS
 ---------------------------------------------------------
 Self-consistent calculations for electrons
 ---------------------------------------------------------
 SPIN    KPOINTS         PROCESSORS  
 1       1688            4           
 ---------------------------------------------------------
 Use plane wave basis
 ---------------------------------------------------------
 ELEMENT NATOM       XC          
 Ce      1           
 N       1           
 ---------------------------------------------------------
 Initial plane wave basis and FFT box
 ---------------------------------------------------------
 DONE(0.66216    SEC) : INIT PLANEWAVE
 MEMORY FOR PSI (MB)  : 452.065
 DONE(0.68265    SEC) : LOCAL POTENTIAL
 DONE(0.912183   SEC) : NON-LOCAL POTENTIAL
 DONE(1.53563    SEC) : INIT BASIS
 -------------------------------------------
 SELF-CONSISTENT : 
 -------------------------------------------
 START CHARGE      : atomic
 DONE(1.6929     SEC) : INIT SCF
 ITER   ETOT(eV)       EDIFF(eV)      DRHO       TIME(s)    
 DA1    -1.551602e+03  0.000000e+00   1.181e+00  3.450e+01  
 DA2    -1.525382e+03  2.622072e+01   2.602e+01  2.274e+01  
 DA3    -1.561370e+03  -3.598852e+01  3.821e-01  2.117e+01  
 DA4    -1.560817e+03  5.530934e-01   1.890e-01  1.389e+01  
 DA5    -1.561091e+03  -2.739730e-01  7.426e-03  1.682e+01  
 DA6    -1.561077e+03  1.384720e-02   4.199e-03  1.685e+01  
 DA7    -1.561088e+03  -1.071650e-02  1.216e-04  1.694e+01  
 DA8    -1.561087e+03  1.108854e-03   3.370e-04  2.127e+01  
 DA9    -1.561088e+03  -7.190399e-04  2.464e-05  1.790e+01  
 DA10   -1.561088e+03  -4.449137e-06  5.798e-06  1.404e+01  
 DA11   -1.561088e+03  -1.286651e-05  1.279e-07  1.580e+01  
 DA12   -1.561088e+03  -6.877970e-07  1.056e-07  2.166e+01  
 DA13   -1.561088e+03  -5.007883e-08  2.526e-08  1.450e+01  
 DA14   -1.561088e+03  -1.981842e-08  1.054e-09  1.412e+01  
----------------------------------------------------------------
TOTAL-STRESS (KBAR)                                           
----------------------------------------------------------------
       -3.6008292097         0.0002824665         0.0000842847
        0.0002824665        -3.6008395989        -0.0003965847
        0.0000842847        -0.0003965847        -3.6009644554
----------------------------------------------------------------
pxlxingliang commented 4 months ago

I use the previous compiler environment, and re-run 075 and 084 with commit: 7f84a09 (Fri Apr 26 11:07:47 2024 +0000), and the results are consistent with CPU. I re-run test on previous commit (db23a2b (Tue Apr 16 21:37:59 2024 +0800)), while this time the results are consistent with CPU.

It is strange, the results are different on commit db23a2b at different date.

pxlxingliang commented 4 months ago

@denghuilu Could you retest the previous compiled environment. I can not reproduce the error now.

denghuilu commented 4 months ago

@denghuilu Could you retest the previous compiled environment. I can not reproduce the error now.

I also cannot reproduce the problem.

WHUweiqingzhou commented 4 months ago

Since this issue cannot be reproduced now, we close it now. It can be reopened once the bug occurs again.

denghuilu commented 4 months ago

Strangely, the same ABACUS executable file produced different results when run yesterday compared to last week.

pxlxingliang commented 4 months ago

I have retest these two examples with commit 9c5eb85 (Wed May 8 14:00:38 2024 +0800) and using bohrium image "registry.dp.tech/dptech/abacus:v3.6.0" with "machine_type": "4 * DCU_16g", and there are consistent with results of CPU.

example energy device
cpu/075_NCe -1561.087565 cpu
dcu/075_NCe(9c5eb85) -1561.0875651271257993 gpu
dcu/075_NCe(bohrium) -1561.0875651263179407 gpu
cpu/084_PLa -1061.115456 cpu
dcu/084_PLa(9c5eb85) -1061.1154559782546585 gpu
dcu/084_PLa(bohrium) -1061.1154559782228262 gpu
WHUweiqingzhou commented 3 months ago

This issue is from the machine issue, not related with ABACUS.