unexpected weak scaling performance of CTF

rohany commented 3 years ago

Hi! I'm doing some experiments using CTF on the Lassen supercomputer. I've compiled CTF without any extensions, and am using OpenBLAS as the blas library. I'm not seeing the performance I expect when weak scaling the examples/matmul program. I'm seeing this performance at the following node counts (using all 20 cores available on the machine):

Nodes	Exec time (s)
1	2.9
2	3.97
4	3.97
8	3.8
16	6.0
32	6.7
64	3.9
128	6.8
256	6.5

The invocation used at one node is:

jsrun -b none -n 1 -r 1 -c ALL_CPUS bin/matmul -m 8192 -n 8192 -k 8192 -niter 5 -sp_A 1 -sp_B 1 -sp_C 1 -test 0

Problem sizes were weak scaled from this size.

I don't expect these sort of fluctuations and not great weak scaling. Is this expected behavior, or are there more things to try to improve the performance here?

solomonik commented 3 years ago

@rohany some fluctuations are normal for CTF scaling performance as CTF chooses a processor grid at run-time based on performance-model estimates of a large space of configurations. Sometimes redistributions may be required given an unsuitable initial layout, or CTF may decide to use a 3D algorithm vs a 2D algorithm. Its possible to tune the model parameters on a particular architecture and for particular end kernels. But fluctuations may be unavoidable, the performance models do not account for network topology / # processes per node, among other variables that may affect performance. Also when the number of nodes is 2^k as opposed to 2^{2k+1} a square processor grid is possible and more communication efficient.

For plain matrix multiplication, I believe there has been recent libraries developed (https://dl.acm.org/doi/abs/10.1145/3295500.3356181) that outperform CTF in the distributed setting.

rohany commented 2 years ago

Closing this, as my question was resolved at that time.

cyclops-community / ctf

unexpected weak scaling performance of CTF #121