easybuilders / easybuild-easyconfigs

A collection of easyconfig files that describe which software to build using which build options with EasyBuild.
https://easybuild.io
GNU General Public License v2.0
380 stars 702 forks source link

Performance discrepancy in ELSI tests #14172

Open ocaisa opened 3 years ago

ocaisa commented 3 years ago

In https://github.com/easybuilders/easybuild-easyconfigs/pull/14133, I saw a large discrepancy between the test times for foss and intel (~2 minutes compared to ~30 minutes).

Tagging @micaeljtoliveira

migueldiascosta commented 3 years ago

@ocaisa, fwiw, I get (on an AMD Naples node) 1 min 18 secs for foss and 1 min 39 secs for intel, so the problem seems to be system dependent?

ocaisa commented 3 years ago

That's good, maybe it was just other processes on the login node of Generoso where I ran the test.

branfosj commented 3 years ago

I'm also seeing this. I was running inside a cgroup of 8 cores out of a 40 core cascadelake box:

And looking at the logs, each of the tests is about 25 times slower.

Edit: similar results with access to the full node as well. I stopped the foss one after 10 minutes vs 57.05 sec

akesandgren commented 3 years ago

Then my guess would be OMP_NUM_THREADS and OpenBLAS

ocaisa commented 3 years ago

I did notice that the MPI tests were using 200% CPU (even though the MPI call had -np 4...and generoso has 8 cores)

ocaisa commented 3 years ago

For comparison, the build logs on Generoso for the foss and intel tests:

Test project /home/ocaisa/.local/easybuild/build/ELSI/2.6.4/foss-2020b-PEXSI/easybuild_obj
      Start  1: test_fortran_01_elpa
 1/56 Test  #1: test_fortran_01_elpa .............   Passed   28.51 sec
      Start  2: test_fortran_02_elpa
 2/56 Test  #2: test_fortran_02_elpa .............   Passed   22.86 sec
      Start  3: test_fortran_03_elpa
 3/56 Test  #3: test_fortran_03_elpa .............   Passed   29.36 sec
      Start  4: test_fortran_04_elpa
 4/56 Test  #4: test_fortran_04_elpa .............   Passed   21.66 sec
      Start  5: test_fortran_05_elpa
 5/56 Test  #5: test_fortran_05_elpa .............   Passed   28.47 sec
      Start  6: test_fortran_06_elpa
 6/56 Test  #6: test_fortran_06_elpa .............   Passed   24.98 sec
      Start  7: test_fortran_07_elpa
 7/56 Test  #7: test_fortran_07_elpa .............   Passed   28.61 sec
      Start  8: test_fortran_08_elpa
 8/56 Test  #8: test_fortran_08_elpa .............   Passed   22.14 sec
      Start  9: test_fortran_09_elpa
 9/56 Test  #9: test_fortran_09_elpa .............   Passed   25.28 sec
      Start 10: test_fortran_10_elpa
10/56 Test #10: test_fortran_10_elpa .............   Passed   14.28 sec
      Start 11: test_fortran_11_elpa
11/56 Test #11: test_fortran_11_elpa .............   Passed   25.44 sec
      Start 12: test_fortran_12_elpa
12/56 Test #12: test_fortran_12_elpa .............   Passed   14.31 sec
      Start 13: test_fortran_13_elpa
13/56 Test #13: test_fortran_13_elpa .............   Passed   25.59 sec
      Start 14: test_fortran_14_elpa
14/56 Test #14: test_fortran_14_elpa .............   Passed   14.25 sec
      Start 15: test_fortran_15_elpa
15/56 Test #15: test_fortran_15_elpa .............   Passed   26.60 sec
      Start 16: test_fortran_16_elpa
16/56 Test #16: test_fortran_16_elpa .............   Passed   14.56 sec
      Start 17: test_fortran_01_omm
17/56 Test #17: test_fortran_01_omm ..............   Passed   12.23 sec
      Start 18: test_fortran_02_omm
18/56 Test #18: test_fortran_02_omm ..............   Passed   13.99 sec
      Start 19: test_fortran_03_omm
19/56 Test #19: test_fortran_03_omm ..............   Passed   11.97 sec
      Start 20: test_fortran_04_omm
20/56 Test #20: test_fortran_04_omm ..............   Passed    9.63 sec
      Start 21: test_fortran_05_omm
21/56 Test #21: test_fortran_05_omm ..............   Passed   11.79 sec
      Start 22: test_fortran_06_omm
22/56 Test #22: test_fortran_06_omm ..............   Passed    9.63 sec
      Start 23: test_fortran_07_omm
23/56 Test #23: test_fortran_07_omm ..............   Passed   11.58 sec
      Start 24: test_fortran_08_omm
24/56 Test #24: test_fortran_08_omm ..............   Passed    9.32 sec
      Start 25: test_fortran_01_pexsi
25/56 Test #25: test_fortran_01_pexsi ............   Passed  151.37 sec
      Start 26: test_fortran_02_pexsi
26/56 Test #26: test_fortran_02_pexsi ............   Passed  166.81 sec
      Start 27: test_fortran_03_pexsi
27/56 Test #27: test_fortran_03_pexsi ............   Passed  151.99 sec
      Start 28: test_fortran_04_pexsi
28/56 Test #28: test_fortran_04_pexsi ............   Passed  163.70 sec
      Start 29: test_fortran_05_pexsi
29/56 Test #29: test_fortran_05_pexsi ............   Passed  152.93 sec
      Start 30: test_fortran_06_pexsi
30/56 Test #30: test_fortran_06_pexsi ............   Passed  167.22 sec
      Start 31: test_fortran_07_pexsi
31/56 Test #31: test_fortran_07_pexsi ............   Passed  153.46 sec
      Start 32: test_fortran_08_pexsi
32/56 Test #32: test_fortran_08_pexsi ............   Passed  162.83 sec
      Start 33: test_fortran_01_ntpoly
33/56 Test #33: test_fortran_01_ntpoly ...........   Passed   28.29 sec
      Start 34: test_fortran_02_ntpoly
34/56 Test #34: test_fortran_02_ntpoly ...........   Passed   25.97 sec
      Start 35: test_fortran_03_ntpoly
35/56 Test #35: test_fortran_03_ntpoly ...........   Passed   28.32 sec
      Start 36: test_fortran_04_ntpoly
36/56 Test #36: test_fortran_04_ntpoly ...........   Passed   29.14 sec
      Start 37: test_fortran_05_ntpoly
37/56 Test #37: test_fortran_05_ntpoly ...........   Passed   28.67 sec
      Start 38: test_fortran_06_ntpoly
38/56 Test #38: test_fortran_06_ntpoly ...........   Passed   26.51 sec
      Start 39: test_fortran_07_ntpoly
39/56 Test #39: test_fortran_07_ntpoly ...........   Passed   28.05 sec
      Start 40: test_fortran_08_ntpoly
40/56 Test #40: test_fortran_08_ntpoly ...........   Passed   27.69 sec
      Start 41: test_serial_01_lapack
41/56 Test #41: test_serial_01_lapack ............   Passed    0.37 sec
      Start 42: test_serial_02_lapack
42/56 Test #42: test_serial_02_lapack ............   Passed    0.52 sec
      Start 43: test_matio_01
43/56 Test #43: test_matio_01 ....................   Passed    0.25 sec
      Start 44: test_matio_02
44/56 Test #44: test_matio_02 ....................   Passed    0.28 sec
      Start 45: test_matio_03
45/56 Test #45: test_matio_03 ....................   Passed    0.37 sec
      Start 46: test_matio_04
46/56 Test #46: test_matio_04 ....................   Passed    0.38 sec
      Start 47: test_c_01_elpa
47/56 Test #47: test_c_01_elpa ...................   Passed   12.03 sec
      Start 48: test_c_02_elpa
48/56 Test #48: test_c_02_elpa ...................   Passed    5.39 sec
      Start 49: test_c_03_elpa
49/56 Test #49: test_c_03_elpa ...................   Passed    6.33 sec
      Start 50: test_c_04_elpa
50/56 Test #50: test_c_04_elpa ...................   Passed    4.08 sec
      Start 51: test_c_01_omm
51/56 Test #51: test_c_01_omm ....................   Passed    6.44 sec
      Start 52: test_c_02_omm
52/56 Test #52: test_c_02_omm ....................   Passed    4.09 sec
      Start 53: test_c_01_pexsi
53/56 Test #53: test_c_01_pexsi ..................   Passed   43.61 sec
      Start 54: test_c_02_pexsi
54/56 Test #54: test_c_02_pexsi ..................   Passed   87.91 sec
      Start 55: test_c_01_ntpoly
55/56 Test #55: test_c_01_ntpoly .................   Passed    8.71 sec
      Start 56: test_c_02_ntpoly
56/56 Test #56: test_c_02_ntpoly .................   Passed    8.01 sec

100% tests passed, 0 tests failed out of 56

Total Test time (real) = 2139.32 sec

and

Test project /home/ocaisa/.local/easybuild/build/ELSI/2.6.4/intel-2020b-PEXSI/easybuild_obj
      Start  1: test_fortran_01_elpa
 1/56 Test  #1: test_fortran_01_elpa .............   Passed    1.55 sec
      Start  2: test_fortran_02_elpa
 2/56 Test  #2: test_fortran_02_elpa .............   Passed    0.75 sec
      Start  3: test_fortran_03_elpa
 3/56 Test  #3: test_fortran_03_elpa .............   Passed    0.72 sec
      Start  4: test_fortran_04_elpa
 4/56 Test  #4: test_fortran_04_elpa .............   Passed    0.80 sec
      Start  5: test_fortran_05_elpa
 5/56 Test  #5: test_fortran_05_elpa .............   Passed    0.70 sec
      Start  6: test_fortran_06_elpa
 6/56 Test  #6: test_fortran_06_elpa .............   Passed    0.75 sec
      Start  7: test_fortran_07_elpa
 7/56 Test  #7: test_fortran_07_elpa .............   Passed    0.73 sec
      Start  8: test_fortran_08_elpa
 8/56 Test  #8: test_fortran_08_elpa .............   Passed    0.72 sec
      Start  9: test_fortran_09_elpa
 9/56 Test  #9: test_fortran_09_elpa .............   Passed    0.68 sec
      Start 10: test_fortran_10_elpa
10/56 Test #10: test_fortran_10_elpa .............   Passed    0.79 sec
      Start 11: test_fortran_11_elpa
11/56 Test #11: test_fortran_11_elpa .............   Passed    0.68 sec
      Start 12: test_fortran_12_elpa
12/56 Test #12: test_fortran_12_elpa .............   Passed    0.77 sec
      Start 13: test_fortran_13_elpa
13/56 Test #13: test_fortran_13_elpa .............   Passed    0.72 sec
      Start 14: test_fortran_14_elpa
14/56 Test #14: test_fortran_14_elpa .............   Passed    0.79 sec
      Start 15: test_fortran_15_elpa
15/56 Test #15: test_fortran_15_elpa .............   Passed    0.75 sec
      Start 16: test_fortran_16_elpa
16/56 Test #16: test_fortran_16_elpa .............   Passed    0.81 sec
      Start 17: test_fortran_01_omm
17/56 Test #17: test_fortran_01_omm ..............   Passed    0.67 sec
      Start 18: test_fortran_02_omm
18/56 Test #18: test_fortran_02_omm ..............   Passed    0.82 sec
      Start 19: test_fortran_03_omm
19/56 Test #19: test_fortran_03_omm ..............   Passed    0.78 sec
      Start 20: test_fortran_04_omm
20/56 Test #20: test_fortran_04_omm ..............   Passed    0.85 sec
      Start 21: test_fortran_05_omm
21/56 Test #21: test_fortran_05_omm ..............   Passed    0.76 sec
      Start 22: test_fortran_06_omm
22/56 Test #22: test_fortran_06_omm ..............   Passed    0.89 sec
      Start 23: test_fortran_07_omm
23/56 Test #23: test_fortran_07_omm ..............   Passed    0.77 sec
      Start 24: test_fortran_08_omm
24/56 Test #24: test_fortran_08_omm ..............   Passed    0.86 sec
      Start 25: test_fortran_01_pexsi
25/56 Test #25: test_fortran_01_pexsi ............   Passed    3.91 sec
      Start 26: test_fortran_02_pexsi
26/56 Test #26: test_fortran_02_pexsi ............   Passed    5.31 sec
      Start 27: test_fortran_03_pexsi
27/56 Test #27: test_fortran_03_pexsi ............   Passed    3.88 sec
      Start 28: test_fortran_04_pexsi
28/56 Test #28: test_fortran_04_pexsi ............   Passed    5.35 sec
      Start 29: test_fortran_05_pexsi
29/56 Test #29: test_fortran_05_pexsi ............   Passed    3.99 sec
      Start 30: test_fortran_06_pexsi
30/56 Test #30: test_fortran_06_pexsi ............   Passed    5.23 sec
      Start 31: test_fortran_07_pexsi
31/56 Test #31: test_fortran_07_pexsi ............   Passed    4.05 sec
      Start 32: test_fortran_08_pexsi
32/56 Test #32: test_fortran_08_pexsi ............   Passed    5.31 sec
      Start 33: test_fortran_01_ntpoly
33/56 Test #33: test_fortran_01_ntpoly ...........   Passed    3.28 sec
      Start 34: test_fortran_02_ntpoly
34/56 Test #34: test_fortran_02_ntpoly ...........   Passed    5.45 sec
      Start 35: test_fortran_03_ntpoly
35/56 Test #35: test_fortran_03_ntpoly ...........   Passed    2.78 sec
      Start 36: test_fortran_04_ntpoly
36/56 Test #36: test_fortran_04_ntpoly ...........   Passed   25.67 sec
      Start 37: test_fortran_05_ntpoly
37/56 Test #37: test_fortran_05_ntpoly ...........   Passed    2.98 sec
      Start 38: test_fortran_06_ntpoly
38/56 Test #38: test_fortran_06_ntpoly ...........   Passed    5.06 sec
      Start 39: test_fortran_07_ntpoly
39/56 Test #39: test_fortran_07_ntpoly ...........   Passed    3.02 sec
      Start 40: test_fortran_08_ntpoly
40/56 Test #40: test_fortran_08_ntpoly ...........   Passed    5.55 sec
      Start 41: test_serial_01_lapack
41/56 Test #41: test_serial_01_lapack ............   Passed    0.50 sec
      Start 42: test_serial_02_lapack
42/56 Test #42: test_serial_02_lapack ............   Passed    0.60 sec
      Start 43: test_matio_01
43/56 Test #43: test_matio_01 ....................   Passed    0.44 sec
      Start 44: test_matio_02
44/56 Test #44: test_matio_02 ....................   Passed    0.45 sec
      Start 45: test_matio_03
45/56 Test #45: test_matio_03 ....................   Passed    0.73 sec
      Start 46: test_matio_04
46/56 Test #46: test_matio_04 ....................   Passed    0.70 sec
      Start 47: test_c_01_elpa
47/56 Test #47: test_c_01_elpa ...................   Passed    0.68 sec
      Start 48: test_c_02_elpa
48/56 Test #48: test_c_02_elpa ...................   Passed    0.69 sec
      Start 49: test_c_03_elpa
49/56 Test #49: test_c_03_elpa ...................   Passed    0.65 sec
      Start 50: test_c_04_elpa
50/56 Test #50: test_c_04_elpa ...................   Passed    0.69 sec
      Start 51: test_c_01_omm
51/56 Test #51: test_c_01_omm ....................   Passed    0.72 sec
      Start 52: test_c_02_omm
52/56 Test #52: test_c_02_omm ....................   Passed    0.69 sec
      Start 53: test_c_01_pexsi
53/56 Test #53: test_c_01_pexsi ..................   Passed    1.39 sec
      Start 54: test_c_02_pexsi
54/56 Test #54: test_c_02_pexsi ..................   Passed    1.85 sec
      Start 55: test_c_01_ntpoly
55/56 Test #55: test_c_01_ntpoly .................   Passed    1.38 sec
      Start 56: test_c_02_ntpoly
56/56 Test #56: test_c_02_ntpoly .................   Passed    2.17 sec

100% tests passed, 0 tests failed out of 56

Total Test time (real) = 124.60 sec
migueldiascosta commented 3 years ago

weird - just tested again - I'm not restricting the number of threads, and I do see both foss and intel tests using many threads per process, which is probably suboptimal, but in my case they are still quite fast in both cases, and actually a bit faster for foss...

Can anyone else test on AMD, to see if it is anything to do with Intel (e.g. avx512?)

micaeljtoliveira commented 3 years ago

I see exactly the same discrepancy on an Intel CPU.

When running the foss tests, I can see a 300-400% CPU usage with 4 MPI threads, which seems correct (the node has 16 cores). Restricting the number of OpenMP threads to 1 makes the tests considerably slower.

I think it's also clear that the problem comes from ELSI itself, as the tests seem to run slower on foss for all the back-ends.

Weird.

migueldiascosta commented 3 years ago

indeed, on Haswell it also takes too long for me. So it's not just the threads, it also depends on the cpu (?)

ocaisa commented 3 years ago

In https://github.com/easybuilders/easybuild-easyconfigs/pull/14180, @akesandgren includes a later ELSI with foss/2021a. Tests are faster with that (12 minutes), but still with lots of thread activity and I suspect still slower than intel (UPDATE: the intel version was included in https://github.com/easybuilders/easybuild-easyconfigs/pull/14183 and the tests take under 2 minutes).

migueldiascosta commented 3 years ago

one difference I see is that on AMD all the processes and threads are using the same socket, while on Intel not only the processes span both sockets, which isn't a problem, but threads with the same parent process span different sockets, which could explain the performance difference

I'm not using any explicit affinity setting in these tests, so I suppose it's either OpenMPI or OpenBLAS that's setting different threading affinities in Intel and AMD?

On the other hand, if that was the case, surely this would have been noticed before (?)

curiouser and curiouser...

ocaisa commented 3 years ago

@vyu16 Maybe you have some hints here?

volkerblum commented 3 years ago

What is the environment under which this runs? If this happened all the time, we would definitely know. However, in other contexts, I have seen similar issues from mis-configured slurm (i.e., internal defaults in some slurm versions that can be catastrophic). Is slurm involved here?

migueldiascosta commented 3 years ago

@volkerblum I ran all my tests, including the slow ones on intel haswell, in an ssh session to a dedicated node that I had offlined from the batch system