deepmodeling / abacus-develop

An electronic structure package based on either plane wave basis or numerical atomic orbitals.
http://abacus.ustc.edu.cn
GNU Lesser General Public License v3.0
171 stars 130 forks source link

relax results on DCU device differ from CPU/GPU ones #2417

Closed flshbc closed 1 year ago

flshbc commented 1 year ago

Describe the bug

I was trying to test relax examples on DCU device, and met some confusing results. test version: ABACUS commit-f1e8856a64b35f561fcb99b3baf6c2dcb67c939a(2023/05/09) test example: abacus-develop/examples/relax/pw_al Results below are tested more than once and appear to be relatively stable. However, there are a few special cases. Full log is provided at the end of this issue.

CPU/GPU results(usually same): GPU configure/compilation: cmake -B build -DUSE_CUDA=1 run command: OMP_NUM_THREADS=1 mpirun -np 2 abacus cycles:14

total pressure: 1.494e-02 KBAR
total stress:
 1.164e-01      -5.944e-05     2.553e-05
 -5.944e-05     1.164e-01      -4.223e-07
 2.553e-05      -4.223e-07     -1.880e-01

CPU special results:(occurred once) cycles:59 run command: OMP_NUM_THREADS=1 mpirun -np 2 abacus

total pressure: 6.366e-03 KBAR
total stress:
 1.083e-01      1.709e-04      3.255e-04
 1.709e-04      1.076e-01      2.031e-05
 3.255e-04      2.031e-05      -1.969e-01

DCU results: cycles:43 DCU configure/compilation:

CC=clang CXX=clang++ cmake -B build -DUSE_OPENMP=OFF -DENABLE_LCAO=OFF \
-DFFTW3_DIR=/public/home/aisi/users/denghui/abacus/soft/jzw_link/fftw-3.3.10/build \
-DLAPACK_DIR=/public/home/aisi/users/denghui/abacus/soft/jzw_link/OpenBLAS-0.3.21/build/lib \
-DSCALAPACK_DIR=/public/home/aisi/users/denghui/abacus/soft/jzw_link/scalapack-2.2.0 \
-DUSE_ROCM=ON

run command: OMP_NUM_THREADS=1 mpirun -np 4 abacus

total pressure:1.542e-02 KBAR
total stress:
 1.171e-01      -5.991e-05     2.778e-04      
 -5.991e-05     1.165e-01      1.679e-05      
  2.778e-04      1.679e-05      -1.874e-01

Expected behavior

The expected behavior is to reproduce results same or extremely similar to abacus-develop/examples/relax/pw_al/log.ref which has been tested under 3.2.0 version for reference. But neither CPU/GPU nor DCU results precisely match this. cycles:14

total pressure: 9.097e-03 KBAR
total stress: 
1.106e-01      -6.820e-05     1.811e-05      
 -6.820e-05     1.106e-01      -1.879e-07     
 1.811e-05      -1.879e-07     -1.939e-01 

To Reproduce

git clone https://github.com/deepmodeling/abacus-develop -b f1e8856a64b35f561fcb99b3baf6c2dcb67c939a

[CPU] cmake -B build 
[GPU] cmake -B build -DUSE_CUDA=1
[DCU] CC=clang CXX=clang++ cmake -B build -DUSE_OPENMP=OFF -DENABLE_LCAO=OFF \
-DFFTW3_DIR=/public/home/aisi/users/denghui/abacus/soft/jzw_link/fftw-3.3.10/build \
-DLAPACK_DIR=/public/home/aisi/users/denghui/abacus/soft/jzw_link/OpenBLAS-0.3.21/build/lib \
-DSCALAPACK_DIR=/public/home/aisi/users/denghui/abacus/soft/jzw_link/scalapack-2.2.0 \
-DUSE_ROCM=ON

cd build
make

cd ../examples/relax/pw_al/
OMP_NUM_THREADS=1 mpirun -np 2 ../../../build/abacus

Environment

CPU/GPU: OS: Ubuntu 20.04.2 LTS (GNU/Linux 5.15.0-69-generic x86_64) comiler: gcc 9.4.0 dependencies: FFTW3/OpenBLAS/scaLAPACK/ELPA/CUDA 11.7

DCU: OS: centos-build-7.6 compiler: clang/clang++ 14.0.0 dependencies: FFTW3/OpenBLAS/scaLAPACK/HIP

Additional Context

output log: log_ref.txt cpu_relax_log.txt cpu_relax_log_special.txt gpu_relax_log.txt dcu_relax_log.txt

LiuXiaohui123321 commented 1 year ago

I could reproduce the problem of inconsistency between the DCU results and the CPU results, also the problem of inconsistency between my own CPU results and the reference CPU results. I think there are three problems about this issue:

@dyzheng I need to know that which ABACUS version was used in the reference calculation.

LiuXiaohui123321 commented 1 year ago

@Flashbac09 "CPU/GPU results(usually same)": in case of CPU, how many CPU cores did you use? Is the number of CPU cores same with that in "CPU special results:(occurred once)"?

CPU/GPU results(usually same): GPU configure/compilation: cmake -B build -DUSE_CUDA=1 run command: OMP_NUM_THREADS=1 mpirun -np 2 abacus cycles:14

total pressure: 1.494e-02 KBAR
total stress:
 1.164e-01      -5.944e-05     2.553e-05
 -5.944e-05     1.164e-01      -4.223e-07
 2.553e-05      -4.223e-07     -1.880e-01

@Flashbac09 In case of DCU, how many DCUs did you use?

DCU results: cycles:43 DCU configure/compilation:

CC=clang CXX=clang++ cmake -B build -DUSE_OPENMP=OFF -DENABLE_LCAO=OFF \
-DFFTW3_DIR=/public/home/aisi/users/denghui/abacus/soft/jzw_link/fftw-3.3.10/build \
-DLAPACK_DIR=/public/home/aisi/users/denghui/abacus/soft/jzw_link/OpenBLAS-0.3.21/build/lib \
-DSCALAPACK_DIR=/public/home/aisi/users/denghui/abacus/soft/jzw_link/scalapack-2.2.0 \
-DUSE_ROCM=ON

run command: OMP_NUM_THREADS=1 mpirun -np 4 abacus

total pressure:1.542e-02 KBAR
total stress:
 1.171e-01      -5.991e-05     2.778e-04      
 -5.991e-05     1.165e-01      1.679e-05      
  2.778e-04      1.679e-05      -1.874e-01
flshbc commented 1 year ago
  1. I kept using 2 CPU cores on CPU version and GPU version.
  2. In case of DCU, I set 4 DCUs. I tend to set 4, and I haven't tested other cases.

Thanks for your test. The summary of problems are quite precise, but it's hard for me to find the clue. I guess the inconsistency can came from both relax algorithm which is developing by commits and libraries used specifically by ROCm/hip code.

LiuXiaohui123321 commented 1 year ago
  1. I kept using 2 CPU cores on CPU version and GPU version.
  2. In case of DCU, I set 4 DCUs. I tend to set 4, and I haven't tested other cases.

Thanks for your test. The summary of problems are quite precise, but it's hard for me to find the clue. I guess the inconsistency can came from both relax algorithm which is developing by commits and libraries used specifically by ROCm/hip code.

@Flashbac09 Sorry for the late reply. For the problem of "the DCU results are different from the CPU result", there are some new test results. I found that the difference between DCU and CPU appeared after the first ionic step. And the difference was in the position of some atoms, as shown in the following results. intel-nvidia amd-dcu

For the y position of the first atom, Intel-CPU、Intel-CPU+Nvidia-GPU and AMD-CPU 3 platforms gave the same result, which is 0.9999999... . But the AMD-CPU+DCU platform gave a different result, which is 3.07402677216e-08.

@dyzheng @mohanchen Further analysis found that this is because when processing atoms at the boundary of the lattice vector, the results obtained by the DCU platform and the CPU platform will have numerical errors, but the final results obtained by them are basically the same.

I have some more detailed analysis of this problem, and I will continue to update them later.

LiuXiaohui123321 commented 1 year ago

I think I've found the cause of the DCU problem. Some of the atomic force calculated by DCU is different from that calculated by CPU, especially the direction of the force.

force-intel-nvidia force-amd-dcu

Then, the atomic movements are also different (especially the direction). about "move" : https://github.com/deepmodeling/abacus-develop/blob/8fface85c627589b3b1fe02dfffb439b878a91f1/source/module_relax/relax_old/ions_move_bfgs.cpp#L306

move-intel-cpu move-intel-nvidia move-amd-cpu move-amd-dcu

At last, the atomic positions are different (consider periodic boundary conditions). about "pos" : https://github.com/deepmodeling/abacus-develop/blob/8fface85c627589b3b1fe02dfffb439b878a91f1/source/module_relax/relax_old/ions_move_bfgs.cpp#L112

position-intel-cpu position-intel-nvidia position-amd-cpu position-amd-dcu

flshbc commented 1 year ago

@LiuXiaohui123321 Thanks a lot. Awesome and professional trace-back for the problem. In May, I thought I can never dig this out. Test results have shown that numerical error from DCU platform appears to be larger, which inevitably leads to partially different results of atomic movements and positions, thus the different results of final TOTAL-STRESS and TOTAL-PRESSURE.

From my immature point of view, it seems that in the calculation of relaxation, the numerical error caused by devices has affected the validity of calculation. It's not like a single SCF calculation, where the accumulation of numerical error would not make severe difference and results have better veracity in some way.

hongriTianqi commented 1 year ago
hongriTianqi commented 11 months ago

Repeat and solve the issue in detail, so is a good practice.