Closed rohany closed 2 years ago
@rohany some fluctuations are normal for CTF scaling performance as CTF chooses a processor grid at run-time based on performance-model estimates of a large space of configurations. Sometimes redistributions may be required given an unsuitable initial layout, or CTF may decide to use a 3D algorithm vs a 2D algorithm. Its possible to tune the model parameters on a particular architecture and for particular end kernels. But fluctuations may be unavoidable, the performance models do not account for network topology / # processes per node, among other variables that may affect performance. Also when the number of nodes is 2^k as opposed to 2^{2k+1} a square processor grid is possible and more communication efficient.
For plain matrix multiplication, I believe there has been recent libraries developed (https://dl.acm.org/doi/abs/10.1145/3295500.3356181) that outperform CTF in the distributed setting.
Closing this, as my question was resolved at that time.
Hi! I'm doing some experiments using CTF on the Lassen supercomputer. I've compiled CTF without any extensions, and am using OpenBLAS as the blas library. I'm not seeing the performance I expect when weak scaling the examples/matmul program. I'm seeing this performance at the following node counts (using all 20 cores available on the machine):
The invocation used at one node is:
Problem sizes were weak scaled from this size.
I don't expect these sort of fluctuations and not great weak scaling. Is this expected behavior, or are there more things to try to improve the performance here?