Need subspace diagonalization with parallel

Background

Now, the subspace diagonalization of dav is by lapack with one core, while for large system, the dimension of this subspace may be hundreds, and can be effectively accelerated by parallel.

QE has the same function and can be used by setting value of nd in command: https://www.quantum-espresso.org/Doc/user_guide/node20.html

Describe the solution you'd like

I will implement a function to divide the H and S matrices into 2D blocks, and then call elpa or scalapack to do parallel diagonalization.

Task list only for developers

[ ] Notice possible changes of behavior
[ ] Explain the changes of codes in core modules of ESolver, HSolver, ElecState, Hamilt, Operator or Psi

Notice Possible Changes of Behavior (Reminder only for developers)

No response

Notice any changes of core modules (Reminder only for developers)

No response

Notice Possible Changes of Core Modules (Reminder only for developers)

No response

Additional Context

No response

Task list for Issue attackers (only for developers)

[ ] Review and understand the proposed feature and its importance.
[ ] Research on the existing solutions and relevant research articles/resources.
[ ] Discuss with the team to evaluate the feasibility of implementing the feature.
[ ] Create a design document outlining the proposed solution and implementation details.
[ ] Get feedback from the team on the design document.
[ ] Develop the feature following the agreed design.
[ ] Write unit tests and integration tests for the feature.
[ ] Update the documentation to include the new feature.
[ ] Perform code review and address any issues.
[ ] Merge the feature into the main branch.
[ ] Monitor for any issues or bugs reported by users after the feature is released.
[ ] Address any issues or bugs reported by users and continuously improve the feature.

I have conducted a comparative analysis of the computational efficiency of solving generalized eigenvalue problems using ELPA, ScaLAPACK, and LAPACK for matrices of varying dimensions. The results indicate that for small matrices, specifically those with a bandwidth of less than 100, LAPACK demonstrates superior efficiency. However, as the matrix size increases, the efficiency of ELPA and ScaLAPACK becomes more pronounced.

Furthermore, the block size is a significant factor affecting efficiency. For ELPA, optimal performance is achieved with block sizes of either 16 or 32.

The speed up of ELPA/ScaLAPACK relative to LAPACK on matrices of varying dimensions and different cores and block size. Each row represents the number of parallel cores, and each column corresponds to the block size. The two values presented for each configuration represent the speedup of ELPA/ScaLAPACK relative to LAPACK. For each case, 10 random H/S matrix are generated and solve 10 times.
The test codes: https://github.com/deepmodeling/abacus-develop/pull/5549/files#diff-4cfdb3bd4f00aee2894decd88dc0691e059bb6810b1888b32a8bd3c6e48b78f2R326

#ndim=64，nband=50
           1          4          16         20         32         50         64
4   0.56/0.17  0.91/0.39  1.00/0.58  0.95/0.60  1.00/0.71         --          --
8   0.53/0.15  0.82/0.35  0.94/0.55         --         --         --          --
16  0.54/0.15  0.77/0.33  0.85/0.51         --         --         --          --

#ndim=100，nband=50
           1          4          16         20         32         50         64
4   0.52/0.14  0.89/0.37  1.00/0.55  0.97/0.57  0.92/0.60  0.91/0.71         --
8   0.52/0.13  0.83/0.34  0.94/0.54  0.93/0.58          --        --        --
16  0.53/0.12  0.82/0.33  0.89/0.48  0.87/0.53          --        --        --

#ndim=100，nband=80
           1          4          16         20         32         50         64
4   0.71/0.19  1.21/0.49  1.36/0.72  1.35/0.78  1.27/0.80  1.34/0.94      --
8   0.73/0.18  1.17/0.47  1.29/0.74  1.31/0.79  1.32/0.81          --        --
16  0.74/0.17  1.12/0.45  1.18/0.67  1.22/0.74  1.23/0.77          --        --
  
#ndim=200，nband=50
           1          4          16         20         32         50         64
4   0.43/0.09  0.92/0.35  1.02/0.58  1.08/0.60  1.00/0.64  0.95/0.68  0.81/0.68
8   0.54/0.10  1.01/0.34  1.08/0.56  1.10/0.61  1.12/0.65  0.95/0.69  --
16  0.62/0.10  1.03/0.33  1.15/0.55  1.13/0.55  1.08/0.57  0.99/0.68  --

#ndim=200，nband=100
           1          4          16         20         32         50         64
4   0.66/0.13  1.31/0.46  1.46/0.75  1.48/0.77  1.42/0.80  1.41/0.86  1.14/0.85
8   0.82/0.11  1.41/0.36  1.62/0.60  1.54/0.65  1.54/0.66  1.34/0.69  --
16  0.89/0.14  1.50/0.47  1.67/0.73  1.48/0.73  1.52/0.79  1.34/0.88  --

#ndim=200，nband=160
           1          4          16         20         32         50         64
4   0.92/0.13  1.77/0.43  1.99/0.68  2.08/0.71  1.93/0.73  1.85/0.80  1.54/0.76
8   1.17/0.23  2.06/0.71  2.34/1.16  2.26/1.23  2.17/1.26  2.01/1.34  --
16  1.27/0.22  2.08/0.70  2.27/1.06  2.20/1.12  2.02/1.13  1.94/1.28  --

#ndim=300，nband=240
           1          4          16         20         32         50         64
4   1.16/0.25  2.22/0.88  2.47/1.39  2.37/1.43  2.46/1.47  2.24/1.51  2.12/1.55
8   1.60/0.29  2.82/0.97  3.20/1.57  3.15/1.65  3.17/1.73  2.64/1.76  2.43/1.78
16  1.84/0.16  3.14/0.56  3.46/0.87  3.31/0.89  3.28/0.94  2.60/0.97  2.55/1.05

#ndim=400，nband=320
           1          4          16         20         32         50         64
4   1.39/0.29  2.63/1.08  2.94/1.73  2.88/1.76  2.93/1.81  2.71/1.81  2.40/1.78
8   2.12/0.25  3.55/0.82  4.00/1.33  3.97/1.38  3.96/1.46  3.60/1.46  3.01/1.42
16  2.52/0.18  4.51/0.68  4.97/1.11  4.84/1.14  4.67/1.16  4.12/1.22  3.28/1.21

#ndim=500, nband=400
          16         20         32         50         64         128
4   3.36/1.96  3.28/2.02  3.39/2.07  3.14/2.06  2.96/2.03  2.26/1.95
8   4.89/1.32  4.71/1.34  4.80/1.36  4.34/1.38  4.15/1.39  2.72/1.22
16  6.18/1.39  5.87/1.40  6.07/1.45  5.09/1.48  4.66/1.55  2.89/1.54

Below are the times by ELPA/SCALAPACK on large matrix with different cores and block size. Unit is ms.

#ndim=600, nband=500
              16             32             64             128
4   357.33/595.67  348.67/570.67  397.00/579.33  482.00/580.33
8   251.67/854.00  239.67/805.33  274.00/804.33  394.33/868.33
16  193.00/818.33  181.67/775.33  233.67/738.67  367.33/641.67

#ndim=800, nband=600
               16              32              64              128
4   667.00/1188.67  651.33/1125.67  731.00/1148.00  962.33/1216.00
8   447.33/1556.67  436.33/1481.33  516.00/1516.00  714.33/1639.67
16  337.67/1461.33  325.33/1394.00  394.33/1374.00  642.00/1404.00

#ndim=1000, nband=800
                16               32               64               128
4   1150.00/2295.33  1163.67/2240.00  1286.67/2278.33  1607.33/2336.33
8    770.33/2767.00   763.67/2686.00   857.67/2779.67  1098.00/2957.33
16   544.67/2559.00   542.33/2474.00   612.00/2428.00   928.33/2387.33

# ndim=1200, nband=1000
                16               32               64               128
4   1878.33/3905.33  1853.00/3731.33  2052.33/3772.33  2542.33/3962.00
8   1203.00/4494.00  1171.33/4352.33  1296.00/4452.33  1625.67/4744.67
16   831.67/4086.67   818.67/3938.67   923.67/3925.00   923.67/3925.00

deepmodeling / abacus-develop