Closed flshbc closed 1 year ago
I could reproduce the problem of inconsistency between the DCU results and the CPU results, also the problem of inconsistency between my own CPU results and the reference CPU results. I think there are three problems about this issue:
@dyzheng I need to know that which ABACUS version was used in the reference calculation.
@Flashbac09 "CPU/GPU results(usually same)": in case of CPU, how many CPU cores did you use? Is the number of CPU cores same with that in "CPU special results:(occurred once)"?
CPU/GPU results(usually same): GPU configure/compilation:
cmake -B build -DUSE_CUDA=1
run command:OMP_NUM_THREADS=1 mpirun -np 2 abacus
cycles:14total pressure: 1.494e-02 KBAR total stress: 1.164e-01 -5.944e-05 2.553e-05 -5.944e-05 1.164e-01 -4.223e-07 2.553e-05 -4.223e-07 -1.880e-01
@Flashbac09 In case of DCU, how many DCUs did you use?
DCU results: cycles:43 DCU configure/compilation:
CC=clang CXX=clang++ cmake -B build -DUSE_OPENMP=OFF -DENABLE_LCAO=OFF \ -DFFTW3_DIR=/public/home/aisi/users/denghui/abacus/soft/jzw_link/fftw-3.3.10/build \ -DLAPACK_DIR=/public/home/aisi/users/denghui/abacus/soft/jzw_link/OpenBLAS-0.3.21/build/lib \ -DSCALAPACK_DIR=/public/home/aisi/users/denghui/abacus/soft/jzw_link/scalapack-2.2.0 \ -DUSE_ROCM=ON
run command:
OMP_NUM_THREADS=1 mpirun -np 4 abacus
total pressure:1.542e-02 KBAR total stress: 1.171e-01 -5.991e-05 2.778e-04 -5.991e-05 1.165e-01 1.679e-05 2.778e-04 1.679e-05 -1.874e-01
Thanks for your test. The summary of problems are quite precise, but it's hard for me to find the clue. I guess the inconsistency can came from both relax algorithm which is developing by commits and libraries used specifically by ROCm/hip code.
- I kept using 2 CPU cores on CPU version and GPU version.
- In case of DCU, I set 4 DCUs. I tend to set 4, and I haven't tested other cases.
Thanks for your test. The summary of problems are quite precise, but it's hard for me to find the clue. I guess the inconsistency can came from both relax algorithm which is developing by commits and libraries used specifically by ROCm/hip code.
@Flashbac09 Sorry for the late reply. For the problem of "the DCU results are different from the CPU result", there are some new test results. I found that the difference between DCU and CPU appeared after the first ionic step. And the difference was in the position of some atoms, as shown in the following results.
For the y position of the first atom, Intel-CPU、Intel-CPU+Nvidia-GPU and AMD-CPU 3 platforms gave the same result, which is 0.9999999... . But the AMD-CPU+DCU platform gave a different result, which is 3.07402677216e-08.
@dyzheng @mohanchen Further analysis found that this is because when processing atoms at the boundary of the lattice vector, the results obtained by the DCU platform and the CPU platform will have numerical errors, but the final results obtained by them are basically the same.
I have some more detailed analysis of this problem, and I will continue to update them later.
I think I've found the cause of the DCU problem. Some of the atomic force calculated by DCU is different from that calculated by CPU, especially the direction of the force.
Then, the atomic movements are also different (especially the direction). about "move" : https://github.com/deepmodeling/abacus-develop/blob/8fface85c627589b3b1fe02dfffb439b878a91f1/source/module_relax/relax_old/ions_move_bfgs.cpp#L306
At last, the atomic positions are different (consider periodic boundary conditions). about "pos" : https://github.com/deepmodeling/abacus-develop/blob/8fface85c627589b3b1fe02dfffb439b878a91f1/source/module_relax/relax_old/ions_move_bfgs.cpp#L112
@LiuXiaohui123321 Thanks a lot. Awesome and professional trace-back for the problem. In May, I thought I can never dig this out. Test results have shown that numerical error from DCU platform appears to be larger, which inevitably leads to partially different results of atomic movements and positions, thus the different results of final TOTAL-STRESS and TOTAL-PRESSURE.
From my immature point of view, it seems that in the calculation of relaxation, the numerical error caused by devices has affected the validity of calculation. It's not like a single SCF calculation, where the accumulation of numerical error would not make severe difference and results have better veracity in some way.
Repeat and solve the issue in detail, so is a good practice.
Describe the bug
I was trying to test relax examples on DCU device, and met some confusing results. test version: ABACUS commit-f1e8856a64b35f561fcb99b3baf6c2dcb67c939a(2023/05/09) test example: abacus-develop/examples/relax/pw_al Results below are tested more than once and appear to be relatively stable. However, there are a few special cases. Full log is provided at the end of this issue.
CPU/GPU results(usually same): GPU configure/compilation:
cmake -B build -DUSE_CUDA=1
run command:OMP_NUM_THREADS=1 mpirun -np 2 abacus
cycles:14CPU special results:(occurred once) cycles:59 run command:
OMP_NUM_THREADS=1 mpirun -np 2 abacus
DCU results: cycles:43 DCU configure/compilation:
run command:
OMP_NUM_THREADS=1 mpirun -np 4 abacus
Expected behavior
The expected behavior is to reproduce results same or extremely similar to
abacus-develop/examples/relax/pw_al/log.ref
which has been tested under 3.2.0 version for reference. But neither CPU/GPU nor DCU results precisely match this. cycles:14To Reproduce
Environment
CPU/GPU: OS: Ubuntu 20.04.2 LTS (GNU/Linux 5.15.0-69-generic x86_64) comiler: gcc 9.4.0 dependencies: FFTW3/OpenBLAS/scaLAPACK/ELPA/CUDA 11.7
DCU: OS: centos-build-7.6 compiler: clang/clang++ 14.0.0 dependencies: FFTW3/OpenBLAS/scaLAPACK/HIP
Additional Context
output log: log_ref.txt cpu_relax_log.txt cpu_relax_log_special.txt gpu_relax_log.txt dcu_relax_log.txt