eth-cscs / DLA-Future

DLA-Future
https://eth-cscs.github.io/DLA-Future/master/
BSD 3-Clause "New" or "Revised" License
62 stars 13 forks source link

Some unit tests are too slow #568

Open teonnik opened 2 years ago

teonnik commented 2 years ago

Anything over 5-10s is IMO too much. The waiting time may discourage developers from running unit tests frequently enough. On my laptop with an Intel i7-8550U (8) @ 4.000GHz, the following tests take too long to execute:

43/51 Test #43: test_reduction_to_band ...........   Passed   78.29 sec
44/51 Test #44: test_bt_reduction_to_band ........   Passed   55.41 sec
45/51 Test #45: test_gen_to_std ..................   Passed   17.93 sec
46/51 Test #46: test_cholesky ....................   Passed   15.64 sec
47/51 Test #47: test_compute_t_factor ............   Passed   28.92 sec
49/51 Test #49: test_multiplication_triangular ...   Passed  110.95 sec
51/51 Test #51: test_triangular ..................   Passed  349.46 sec
msimberg commented 2 years ago

I agree with the tests taking very long to finish. There may be multiple reasons for it, but I wonder if one of them is simply oversubscription. I think most of those tests run with 6 ranks, and probably an unconstrained number of threads. Just for comparison, could you try running e.g. test_triangular with --pika:threads=2 --pika:bind=none? There would still be oversubscription, but not as much so I would maybe expect the test to finish faster.

teonnik commented 2 years ago

Yes, indeed, that helped. Unit test speed up more than twice. For example:

test_compute_t_factor ~ 10s test_triangular ~ 150s

but even so, some tests still take a while to finish.

msimberg commented 2 years ago

Do we have a way to restrict a unit test to only use e.g. 4 ranks (for your case of 8 cores), or similar?

rasolca commented 2 years ago

The problem with the triangular solver and multiplication are the 24 different cases that has to be tested. (left/right, upper/lover, non/trans/conj non/diag) on different grid (Note: distributed triangular multiplication doesn't support transposed and conj yet, therefore is faster)

Implementing a cmake flag to reduce 6 rank tests to 4 ranks should be easy, but it remove the most important test: a non square-grid with non trivial communicators in both dimension.

This issue can be linked with #557. My idea is to split some of the tests (blas/lapack/dlaf algotithms) in two parts:

Other possible TODOs: