deepmodeling / abacus-develop

An electronic structure package based on either plane wave basis or numerical atomic orbitals.
http://abacus.ustc.edu.cn
GNU Lesser General Public License v3.0
172 stars 131 forks source link

ABACUS HSE is much slower than QE #4995

Open iduygnay opened 2 months ago

iduygnay commented 2 months ago

Details

The same structure and same accuracy input, this is the time that qe spent: image This is ABACUS with exx_separate_loop = 0, if I comment this line, the calculation will be interrupted because of oom: image

The ABACUS version is 3.7.0, this is the runscript: image

Task list for Issue attackers (only for developers)

QuantumMisaka commented 2 months ago

@iduygnay Can you try exx_separate_loop=1 @PeizeLin Any comments?

iduygnay commented 2 months ago

exx_separate_loop=1 will be oom, and it is also slower than qe.

QuantumMisaka commented 2 months ago

@iduygnay I notice that you're using ABACUS 3.7.0, can you update to the newest ABACUS with LibRI 0.2.0 ?

PeizeLin commented 1 month ago

@iduygnay LibRI-v0.2.0 has improved efficiency many times compared to LibRI-v0.1.1. And ABACUS-v3.7.4 begins to support LibRI-v0.2.0 officially. So it's strongly recommended to update both of the codes.

iduygnay commented 1 month ago

LibRI-0.2.0 exactly has a higher efficiency than 0.1.1, but is still slower than QE. (the same calculation, QE completed in 7h) image

QuantumMisaka commented 1 month ago

@iduygnay What's your dependencies of ABACUS and QE? And What's parallel setting do you use (the parallel number of MPI and OpenMP) in HSE of ABACUS and QE?

Also, What's your ABACUS INPUT ? (I've no idea about QE orz)

iduygnay commented 1 month ago

ABACUS Runscript image ABACUS INPUT (I also tried separate_loop=1, but not converge and had almost the same runtime as the case without that line) image QE input image QE Runscript image

QuantumMisaka commented 1 month ago

@iduygnay

@PeizeLin Other suggestions ?

PeizeLin commented 1 month ago

Multi-threading has a significant impact on memory and speed for exx

export OMP_NUM_THREADS=64
mpirun -np 1 abacus

Slightly reduce the accuracy

exx_dm_threshold  1E-3
exx_ccp_rmesh_times  1

Using double in exx instead of complex will significantly improve speed, but I'm not sure whether it will cause symmetry errors for your STRU. You should compare the results after calculation.

exx_real_number  1
iduygnay commented 1 week ago

I've tried export OMP_NUM_THREADS=64 mpirun -np 1 abacus the speed can be increased and the runtime is about 24 hours now. But 6.5 hours for QE. Image I can also try the other parameters soon.

iduygnay commented 1 week ago

I tried exx_ccp_rmesh_times 1 exx_real_number 1 Now, it takes 20 hours.