Enable fully distributed testing

This PR brings the fully-distributed testing. The distributed result matrix C is split into the lower- triangular partL and the upper-triangular part U. From the permutation vector, the distributed permutation matrix P is constructed. The initial matrix A is multiplied with P (equivalent to permuting rows of matrix A) and the Frobenius norm ||L*U - P*A|| is computed using COSTA and provided scalapack. The result is correct if this norm is small enough.

To enable the testing, it is necessary to build conflux as follows:

 cmake -DCONFLUX_BLAS=MKL -DCONFLUX_SCALAPACK=MKL -DCONFLUX_WITH_VALIDATION=ON ..

Observe that this requires scalapack, as pdgemm is used for matrix multiplication.

When running the miniapp, it is possible to specify the print limit flag, e.g. with -l 20 in which case, matrices with dimension less than 20 will be fully gathered to rank 0 and printed. This is useful for debugging purposes.

For example, the full-output looks as follows:

❯ mpirun -n 4 ./examples/conflux_miniapp -M 10 -N 10 -b 2 --p_grid=2,2,1 -l 30 -r 2
Rank: 0, M: 12, N: 12, P:4, v:2, Px:2, Py: 2, Pz: 1, Nt: 6, tA11x: 3, tA11y: 3
Runtime: 0.154 seconds
Runtime 0: 0.000704 seconds
Runtime 1: 0 seconds
Runtime 2: 0.143 seconds
Runtime 3: 0.00628 seconds
Runtime 4: 0.000993 seconds
Runtime 5: 0.00182 seconds
Runtime 6: 0.000289 seconds
Runtime 7: 0 seconds
Runtime: 0.0134 seconds
Runtime 0: 0.000202 seconds
Runtime 1: 0 seconds
Runtime 2: 0.0065 seconds
Runtime 3: 0.00515 seconds
Runtime 4: 0.000133 seconds
Runtime 5: 0.000745 seconds
Runtime 6: 4.9e-05 seconds
Runtime 7: 0 seconds
Rank [0, 0], local final result:
[ 0:]    5.869   5.028   5.825   5.786   5.870   5.095
[ 1:]    0.859   1.554   0.994   0.399   0.452   1.411
[ 2:]    0.993   0.614  -1.167  -0.460  -0.245  -0.691
[ 3:]    0.963   0.245   0.322  -0.607  -0.076   0.582
[ 4:]    0.952   0.681   0.731  -0.256  -1.093   1.068
[ 5:]    0.969   0.493  -0.006   0.397  -0.110  -0.596
Rank [0, 1], local final result:
[ 0:]    5.762   5.939   5.164   5.280   5.478   5.363
[ 1:]    0.351   0.695   0.751   1.286   0.763   1.204
[ 2:]    0.598   0.035   0.103  -0.528  -0.492  -0.766
[ 3:]   -0.343   0.466   0.193   0.352   0.325   0.297
[ 4:]   -0.316   0.519  -0.708  -0.812   0.009   0.146
[ 5:]   -0.051   0.618  -0.016  -0.276  -0.457  -0.610
Rank [1, 0], local final result:
[ 0:]    0.981   0.456  -0.262  -0.761  -0.689  -0.249
[ 1:]    0.922   0.463  -0.648  -0.248  -0.646   0.460
[ 2:]    0.942   0.456  -0.099   0.127  -0.294   0.223
[ 3:]    0.980   0.134   0.155  -0.517  -0.383   1.078
[ 4:]    0.876   0.442  -0.306   0.641  -0.617  -0.515
[ 5:]    0.939   0.741   0.263   0.557  -0.024   0.130
Rank [1, 1], local final result:
[ 0:]   -0.782  -0.842   0.111  -0.124  -0.581   0.078
[ 1:]   -0.356  -1.040   0.144   0.219   0.176   0.091
[ 2:]    0.241   0.384   0.638  -0.542   0.047   0.210
[ 3:]    0.848  -0.209  -0.099   0.851   0.505   0.045
[ 4:]   -0.262   0.305   0.834   0.083  -0.967  -0.521
[ 5:]    0.301   0.150   0.329  -0.803  -0.378  -0.566
full-A-matrix on rank 0
[ 0:]    5.755   5.639   5.028   5.298   5.903   5.094   5.517   5.640   5.274   5.390   5.140   5.885
[ 1:]    5.752   5.136   5.032   5.417   5.575   5.373   5.079   5.837   5.012   5.524   5.199   5.216
[ 2:]    5.653   5.224   5.904   5.695   5.267   5.061   5.413   5.728   5.545   5.913   5.912   5.526
[ 3:]    5.039   5.871   5.297   5.794   5.995   5.366   5.185   5.820   5.492   5.786   5.467   5.808
[ 4:]    5.685   5.637   5.794   5.496   5.753   5.449   5.522   5.808   5.748   5.149   5.356   5.321
[ 5:]    5.827   5.946   5.468   5.782   5.047   5.065   5.761   5.437   5.425   5.100   5.074   5.347
[ 6:]    5.590   5.847   5.973   5.856   5.119   5.714   5.044   5.275   5.048   5.512   5.135   5.276
[ 7:]    5.513   5.874   5.437   5.684   5.724   5.005   5.806   5.103   5.674   5.073   5.531   5.511
[ 8:]    5.144   5.094   5.409   5.416   5.773   5.123   5.497   5.302   5.751   5.804   5.049   5.547
[ 9:]    5.530   5.446   5.401   5.311   5.745   5.324   5.942   5.171   5.044   5.926   5.575   5.979
[10:]    5.869   5.028   5.762   5.939   5.825   5.786   5.164   5.280   5.870   5.095   5.478   5.363
[11:]    5.410   5.354   5.752   5.056   5.275   5.541   5.212   5.726   5.219   5.899   5.786   5.563
======================
full-C-matrix on rank 0
[ 0:]    5.869   5.028   5.762   5.939   5.825   5.786   5.164   5.280   5.870   5.095   5.478   5.363
[ 1:]    0.859   1.554   0.351   0.695   0.994   0.399   0.751   1.286   0.452   1.411   0.763   1.204
[ 2:]    0.981   0.456  -0.782  -0.842  -0.262  -0.761   0.111  -0.124  -0.689  -0.249  -0.581   0.078
[ 3:]    0.922   0.463  -0.356  -1.040  -0.648  -0.248   0.144   0.219  -0.646   0.460   0.176   0.091
[ 4:]    0.993   0.614   0.598   0.035  -1.167  -0.460   0.103  -0.528  -0.245  -0.691  -0.492  -0.766
[ 5:]    0.963   0.245  -0.343   0.466   0.322  -0.607   0.193   0.352  -0.076   0.582   0.325   0.297
[ 6:]    0.942   0.456   0.241   0.384  -0.099   0.127   0.638  -0.542  -0.294   0.223   0.047   0.210
[ 7:]    0.980   0.134   0.848  -0.209   0.155  -0.517  -0.099   0.851  -0.383   1.078   0.505   0.045
[ 8:]    0.952   0.681  -0.316   0.519   0.731  -0.256  -0.708  -0.812  -1.093   1.068   0.009   0.146
[ 9:]    0.969   0.493  -0.051   0.618  -0.006   0.397  -0.016  -0.276  -0.110  -0.596  -0.457  -0.610
[10:]    0.876   0.442  -0.262   0.305  -0.306   0.641   0.834   0.083  -0.617  -0.515  -0.967  -0.521
[11:]    0.939   0.741   0.301   0.150   0.263   0.557   0.329  -0.803  -0.024   0.130  -0.378  -0.566
======================
full-L-matrix on rank 0
[ 0:]    1.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
[ 1:]    0.859   1.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
[ 2:]    0.981   0.456   1.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
[ 3:]    0.922   0.463  -0.356   1.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
[ 4:]    0.993   0.614   0.598   0.035   1.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
[ 5:]    0.963   0.245  -0.343   0.466   0.322   1.000   0.000   0.000   0.000   0.000   0.000   0.000
[ 6:]    0.942   0.456   0.241   0.384  -0.099   0.127   1.000   0.000   0.000   0.000   0.000   0.000
[ 7:]    0.980   0.134   0.848  -0.209   0.155  -0.517  -0.099   1.000   0.000   0.000   0.000   0.000
[ 8:]    0.952   0.681  -0.316   0.519   0.731  -0.256  -0.708  -0.812   1.000   0.000   0.000   0.000
[ 9:]    0.969   0.493  -0.051   0.618  -0.006   0.397  -0.016  -0.276  -0.110   1.000   0.000   0.000
[10:]    0.876   0.442  -0.262   0.305  -0.306   0.641   0.834   0.083  -0.617  -0.515   1.000   0.000
[11:]    0.939   0.741   0.301   0.150   0.263   0.557   0.329  -0.803  -0.024   0.130  -0.378   1.000
======================
full-U-matrix on rank 0
[ 0:]    5.869   5.028   5.762   5.939   5.825   5.786   5.164   5.280   5.870   5.095   5.478   5.363
[ 1:]    0.000   1.554   0.351   0.695   0.994   0.399   0.751   1.286   0.452   1.411   0.763   1.204
[ 2:]    0.000   0.000  -0.782  -0.842  -0.262  -0.761   0.111  -0.124  -0.689  -0.249  -0.581   0.078
[ 3:]    0.000   0.000   0.000  -1.040  -0.648  -0.248   0.144   0.219  -0.646   0.460   0.176   0.091
[ 4:]    0.000   0.000   0.000   0.000  -1.167  -0.460   0.103  -0.528  -0.245  -0.691  -0.492  -0.766
[ 5:]    0.000   0.000   0.000   0.000   0.000  -0.607   0.193   0.352  -0.076   0.582   0.325   0.297
[ 6:]    0.000   0.000   0.000   0.000   0.000   0.000   0.638  -0.542  -0.294   0.223   0.047   0.210
[ 7:]    0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.851  -0.383   1.078   0.505   0.045
[ 8:]    0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000  -1.093   1.068   0.009   0.146
[ 9:]    0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000  -0.596  -0.457  -0.610
[10:]    0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000  -0.967  -0.521
[11:]    0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000  -0.566
======================
full-P-matrix on rank 0
[ 0:]    0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   1.000   0.000
[ 1:]    0.000   0.000   0.000   1.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
[ 2:]    1.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
[ 3:]    0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   1.000
[ 4:]    0.000   0.000   0.000   0.000   0.000   1.000   0.000   0.000   0.000   0.000   0.000   0.000
[ 5:]    0.000   0.000   1.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
[ 6:]    0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   1.000   0.000   0.000
[ 7:]    0.000   1.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
[ 8:]    0.000   0.000   0.000   0.000   0.000   0.000   1.000   0.000   0.000   0.000   0.000   0.000
[ 9:]    0.000   0.000   0.000   0.000   1.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
[10:]    0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   1.000   0.000   0.000   0.000
[11:]    0.000   0.000   0.000   0.000   0.000   0.000   0.000   1.000   0.000   0.000   0.000   0.000
======================
full-Remainder-matrix on rank 0
[ 0:]    0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
[ 1:]    0.000   0.000   0.000   0.000   0.000  -0.000   0.000   0.000   0.000   0.000   0.000   0.000
[ 2:]    0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
[ 3:]    0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
[ 4:]    0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
[ 5:]    0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000  -0.000   0.000
[ 6:]    0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
[ 7:]    0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000  -0.000   0.000   0.000
[ 8:]    0.000   0.000   0.000   0.000   0.000   0.000   0.000  -0.000   0.000   0.000   0.000   0.000
[ 9:]    0.000   0.000   0.000   0.000   0.000   0.000   0.000  -0.000   0.000   0.000   0.000   0.000
[10:]    0.000   0.000   0.000  -0.000   0.000   0.000   0.000   0.000   0.000  -0.000  -0.000   0.000
[11:]    0.000   0.000   0.000   0.000  -0.000   0.000   0.000  -0.000   0.000   0.000   0.000   0.000
Total Frobenius norm = 0.0000

In this case, the print limit was large enough to allow gathering full matrices to matrix 0 and printing them. Decreasing the print limit would only produce:

❯ mpirun -n 4 ./examples/conflux_miniapp -M 10 -N 10 -b 2 --p_grid=2,2,1 -l 8 -r 2
Rank: 0, M: 12, N: 12, P:4, v:2, Px:2, Py: 2, Pz: 1, Nt: 6, tA11x: 3, tA11y: 3
Runtime: 0.0696 seconds
Runtime 0: 0.000456 seconds
Runtime 1: 0 seconds
Runtime 2: 0.0572 seconds
Runtime 3: 0.00571 seconds
Runtime 4: 0.000448 seconds
Runtime 5: 0.00125 seconds
Runtime 6: 0.000166 seconds
Runtime 7: 0 seconds
Runtime: 0.0167 seconds
Runtime 0: 0.000229 seconds
Runtime 1: 0 seconds
Runtime 2: 0.0103 seconds
Runtime 3: 0.005 seconds
Runtime 4: 0.000165 seconds
Runtime 5: 0.000431 seconds
Runtime 6: 4.1e-05 seconds
Runtime 7: 0 seconds
Total Frobenius norm = 0.0000

In this case, everything is computed in a distributed fashion and no rank holds the full matrix.

In both cases, the total frobenius norm of L*U - P*A is 0, indicating that the result is correct.

eth-cscs / conflux

Enable fully distributed testing #20