Open pxlxingliang opened 1 week ago
I have conducted a comparative analysis of the computational efficiency of solving generalized eigenvalue problems using ELPA, ScaLAPACK, and LAPACK for matrices of varying dimensions. The results indicate that for small matrices, specifically those with a bandwidth of less than 100, LAPACK demonstrates superior efficiency. However, as the matrix size increases, the efficiency of ELPA and ScaLAPACK becomes more pronounced.
Furthermore, the block size is a significant factor affecting efficiency. For ELPA, optimal performance is achieved with block sizes of either 16 or 32.
The speed up of ELPA/ScaLAPACK relative to LAPACK on matrices of varying dimensions and different cores and block size.
Each row represents the number of parallel cores, and each column corresponds to the block size. The two values presented for each configuration represent the speedup of ELPA/ScaLAPACK relative to LAPACK.
For each case, 10 random H/S matrix are generated and solve 10 times.
The test codes: https://github.com/deepmodeling/abacus-develop/pull/5549/files#diff-4cfdb3bd4f00aee2894decd88dc0691e059bb6810b1888b32a8bd3c6e48b78f2R326
#ndim=64,nband=50
1 4 16 20 32 50 64
4 0.56/0.17 0.91/0.39 1.00/0.58 0.95/0.60 1.00/0.71 -- --
8 0.53/0.15 0.82/0.35 0.94/0.55 -- -- -- --
16 0.54/0.15 0.77/0.33 0.85/0.51 -- -- -- --
#ndim=100,nband=50
1 4 16 20 32 50 64
4 0.52/0.14 0.89/0.37 1.00/0.55 0.97/0.57 0.92/0.60 0.91/0.71 --
8 0.52/0.13 0.83/0.34 0.94/0.54 0.93/0.58 -- -- --
16 0.53/0.12 0.82/0.33 0.89/0.48 0.87/0.53 -- -- --
#ndim=100,nband=80
1 4 16 20 32 50 64
4 0.71/0.19 1.21/0.49 1.36/0.72 1.35/0.78 1.27/0.80 1.34/0.94 --
8 0.73/0.18 1.17/0.47 1.29/0.74 1.31/0.79 1.32/0.81 -- --
16 0.74/0.17 1.12/0.45 1.18/0.67 1.22/0.74 1.23/0.77 -- --
#ndim=200,nband=50
1 4 16 20 32 50 64
4 0.43/0.09 0.92/0.35 1.02/0.58 1.08/0.60 1.00/0.64 0.95/0.68 0.81/0.68
8 0.54/0.10 1.01/0.34 1.08/0.56 1.10/0.61 1.12/0.65 0.95/0.69 --
16 0.62/0.10 1.03/0.33 1.15/0.55 1.13/0.55 1.08/0.57 0.99/0.68 --
#ndim=200,nband=100
1 4 16 20 32 50 64
4 0.66/0.13 1.31/0.46 1.46/0.75 1.48/0.77 1.42/0.80 1.41/0.86 1.14/0.85
8 0.82/0.11 1.41/0.36 1.62/0.60 1.54/0.65 1.54/0.66 1.34/0.69 --
16 0.89/0.14 1.50/0.47 1.67/0.73 1.48/0.73 1.52/0.79 1.34/0.88 --
#ndim=200,nband=160
1 4 16 20 32 50 64
4 0.92/0.13 1.77/0.43 1.99/0.68 2.08/0.71 1.93/0.73 1.85/0.80 1.54/0.76
8 1.17/0.23 2.06/0.71 2.34/1.16 2.26/1.23 2.17/1.26 2.01/1.34 --
16 1.27/0.22 2.08/0.70 2.27/1.06 2.20/1.12 2.02/1.13 1.94/1.28 --
#ndim=300,nband=240
1 4 16 20 32 50 64
4 1.16/0.25 2.22/0.88 2.47/1.39 2.37/1.43 2.46/1.47 2.24/1.51 2.12/1.55
8 1.60/0.29 2.82/0.97 3.20/1.57 3.15/1.65 3.17/1.73 2.64/1.76 2.43/1.78
16 1.84/0.16 3.14/0.56 3.46/0.87 3.31/0.89 3.28/0.94 2.60/0.97 2.55/1.05
#ndim=400,nband=320
1 4 16 20 32 50 64
4 1.39/0.29 2.63/1.08 2.94/1.73 2.88/1.76 2.93/1.81 2.71/1.81 2.40/1.78
8 2.12/0.25 3.55/0.82 4.00/1.33 3.97/1.38 3.96/1.46 3.60/1.46 3.01/1.42
16 2.52/0.18 4.51/0.68 4.97/1.11 4.84/1.14 4.67/1.16 4.12/1.22 3.28/1.21
#ndim=500, nband=400
16 20 32 50 64 128
4 3.36/1.96 3.28/2.02 3.39/2.07 3.14/2.06 2.96/2.03 2.26/1.95
8 4.89/1.32 4.71/1.34 4.80/1.36 4.34/1.38 4.15/1.39 2.72/1.22
16 6.18/1.39 5.87/1.40 6.07/1.45 5.09/1.48 4.66/1.55 2.89/1.54
Below are the times by ELPA/SCALAPACK on large matrix with different cores and block size. Unit is ms.
#ndim=600, nband=500
16 32 64 128
4 357.33/595.67 348.67/570.67 397.00/579.33 482.00/580.33
8 251.67/854.00 239.67/805.33 274.00/804.33 394.33/868.33
16 193.00/818.33 181.67/775.33 233.67/738.67 367.33/641.67
#ndim=800, nband=600
16 32 64 128
4 667.00/1188.67 651.33/1125.67 731.00/1148.00 962.33/1216.00
8 447.33/1556.67 436.33/1481.33 516.00/1516.00 714.33/1639.67
16 337.67/1461.33 325.33/1394.00 394.33/1374.00 642.00/1404.00
#ndim=1000, nband=800
16 32 64 128
4 1150.00/2295.33 1163.67/2240.00 1286.67/2278.33 1607.33/2336.33
8 770.33/2767.00 763.67/2686.00 857.67/2779.67 1098.00/2957.33
16 544.67/2559.00 542.33/2474.00 612.00/2428.00 928.33/2387.33
# ndim=1200, nband=1000
16 32 64 128
4 1878.33/3905.33 1853.00/3731.33 2052.33/3772.33 2542.33/3962.00
8 1203.00/4494.00 1171.33/4352.33 1296.00/4452.33 1625.67/4744.67
16 831.67/4086.67 818.67/3938.67 923.67/3925.00 923.67/3925.00
Background
Now, the subspace diagonalization of dav is by lapack with one core, while for large system, the dimension of this subspace may be hundreds, and can be effectively accelerated by parallel.
QE has the same function and can be used by setting value of
nd
in command: https://www.quantum-espresso.org/Doc/user_guide/node20.htmlDescribe the solution you'd like
I will implement a function to divide the H and S matrices into 2D blocks, and then call elpa or scalapack to do parallel diagonalization.
Task list only for developers
Notice Possible Changes of Behavior (Reminder only for developers)
No response
Notice any changes of core modules (Reminder only for developers)
No response
Notice Possible Changes of Core Modules (Reminder only for developers)
No response
Additional Context
No response
Task list for Issue attackers (only for developers)