Suboptimal `dgemm` OpenBLAS performance and dual-socket scaling

carstenbauer commented 2 years ago

From https://github.com/carstenbauer/julia-dgemm-noctua/issues/2. (cc @ViralBShah)

I benchmarked a simple dgemm call (i.e. mul!(C,A,B)) on Noctua 1 (single and dual-socket Intel Xeon Gold "Skylake" 6148 20-Core CPUs) for multiple BLAS libraries (called from Julia using LBT)

20 cores -> single-socket
40 cores -> dual-socket (full node)

Here are the benchmark results:	BLAS	# cores	size
Intel MKL v2022.0.0 (JLL)	40 cores	10240	2081
Intel MKL v2022.0.0 (JLL)	20 cores	10240	1054
BLIS 0.9.0 (JLL)	40 cores	10240	1890
BLIS 0.9.0 (JLL)	20 cores	10240	990
Octavian 0.3.15	40 cores	10240	1053
Octavian 0.3.15	20 cores	10240	1016
OpenBLAS (shipped with Julia 1.8)	40 cores	10240	1092
OpenBLAS (shipped with Julia 1.8)	20 cores	10240	1063
---------------------	-----------	-------	------
OpenBLAS 0.3.17 (custom)	40 cores	10240	1908
OpenBLAS 0.3.17 (custom)	20 cores	10240	1439
OpenBLAS 0.3.20 (custom)	40 cores	10240	1897
OpenBLAS 0.3.20 (custom)	20 cores	10240	1444
---------------------	-----------	-------	------
OpenBLAS 0.3.17 (JLL)	40 cores	10240	1437
OpenBLAS 0.3.17 (JLL)	20 cores	10240	1124
OpenBLAS 0.3.20 (JLL)	40 cores	10240	1535
OpenBLAS 0.3.20 (JLL)	20 cores	10240	1185

The custom OpenBLAS has been compiled with

make INTERFACE64=1 USE_THREAD=1 NO_AFFINITY=1 GEMM_MULTITHREADING_THRESHOLD=50 NO_STATIC=1 BINARY=64

Primary observations/conclusions:

MKL and BLIS (through MKL.jl and BLISBLAS.jl) scale reasonably well from single to dual-socket but OpenBLAS shipped with Julia 1.8 doesn't scale at all. (Octavian also doesn't scale, see https://github.com/JuliaLinearAlgebra/Octavian.jl/issues/151)
A custom build of OpenBLAS shows overall best single-socket performance and scales reasonably well. Therefore it is not just that OpenBLAS is inferior to MKL/BLIS. Perhaps we use suboptimal build options?
What is particularly curious is the using OpenBLAS_jll (0.3.17 and 0.3.20) manually leads to strictly better performance (both in terms of numbers and scaling) than the default/shipped OpenBLAS. How is the default integration of the OpenBLAS_jll different from just manually doing using OpenBLAS_jll and BLAS.lbt_forward(...; clear=true)? (It's still worse than a custom build of OpenBLAS though.)

I hope we can improve the default OpenBLAS performance and scaling.

ViralBShah commented 2 years ago

@chriselrod Are you able to reproduce this on some of our machines?

chriselrod commented 2 years ago

@chriselrod Are you able to reproduce this on some of our machines?

Not really on deapsea2, which I believe is 32 cores x 2 sockets:

julia> using LinearAlgebra

julia> M = K = N = 16_000;

julia> A = Matrix{Float64}(undef, M, K); B = Matrix{Float64}(undef, K, N); C = Matrix{Float64}(undef, M, N);

julia> let A=A; Threads.@threads for i = eachindex(A); @inbounds A[i] = randn(); end; end;

julia> let A=B; Threads.@threads for i = eachindex(A); @inbounds A[i] = randn(); end; end;

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
996.8121355318401

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1052.3059335030646

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1158.2279430700944

julia> BLAS.set_num_threads(64);

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1424.0752368319172

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1597.7102493574532

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1599.5668391744443

julia> using MKL

julia> BLAS.get_num_threads()
64

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1566.9680236220754

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1688.7183359779663

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1694.5279138377305

julia> BLAS.set_num_threads(32);

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1168.313201204457

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1151.4661029505437

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1112.8236207523348

MKL and OpenBLAS both show about the same speedup. I used Threads.@threads to initialize the arrays because of first touch.

On deapsea2, Octavian thinks we only have 32 cores, and thus refuses to use more than 32 threads. Switching back to hwloc instead of CpuId would fix this:

julia> using Hwloc

julia> Hwloc.num_numa_nodes()
8

julia> Hwloc.num_physical_cores()
64

julia> Hwloc.num_packages()
2

if I recall correctly, we no longer need to support WINE, so I think switching back would be okay.

CpuId seems to figure out we have 32 cores per socket

julia> using CpuId, Octavian

julia> CpuId.cpucores()
32

julia> Octavian.num_cores()
static(32)

but doesn't realize we have two of them.

julia> CpuId.cpunodes()
1

julia> CpuId.cpucores_total()
32

BTW, @carstenbauer also gave me access to the cluster he's running on, but I've been procrastinating on actually signing on and poking around.

chriselrod commented 2 years ago

julia> using LinearAlgebra

julia> BLAS.get_num_threads()
64

julia> M = K = N = 16_000;

julia> A = Matrix{Float64}(undef, M, K); B = Matrix{Float64}(undef, K, N); C = Matrix{Float64}(undef, M, N);

julia> let A=A; Threads.@threads for i = eachindex(A); @inbounds A[i] = randn(); end; end;

julia> let A=B; Threads.@threads for i = eachindex(A); @inbounds A[i] = randn(); end; end;

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
982.1094539637073

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1016.5287429664286

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
993.1180919251972

julia> BLAS.set_num_threads(128);

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1253.0140262394689

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1224.9438668463697

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1216.3002770861356

julia> using MKL

julia> BLAS.get_num_threads()
64

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1299.188240131717

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1383.8168443837112

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1309.8513547840373

julia> BLAS.set_num_threads(128);

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1340.979151645549

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1308.8692178797796

julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1325.0394458234089

julia> BLAS.get_num_threads()
64

This time, CpuId gets it right, and Hwloc gets it wrong???

julia> using Hwloc

julia> Hwloc.num_physical_cores()
64

julia> Hwloc.num_packages()
1

julia> using CpuId

julia> CpuId.cpucores()
64

julia> CpuId.cpucores_total()
128

carstenbauer commented 2 years ago

FWIW, I ran (some of) the same benchmarks on a qualitatively different system (Noctua 2, AMD Milan Zen 3, 2x 64 cores) and also found that the default OpenBLAS doesn't scale. See https://github.com/carstenbauer/julia-dgemm-noctua#noctua-2. (Chris has access to both Noctua 1 and Nocuta 2)

(Might be able to add results from clusters at other HPC centers soon)

carstenbauer commented 2 years ago

@chriselrod Which system is this run on? If it's a Noctua 2 login node, be aware that the login nodes in Noctua 2 are special in the sense that they only have 1x64 physical cores but HT enabled (i.e. 128 virtual cores). OTOH, compute nodes have 2x64 physical cores and HT disabled (i.e. 128 physical == virtual cores).

Also, for completeness, which Julia version and how many Julia threads are you using?

JuliaLang / julia

Suboptimal `dgemm` OpenBLAS performance and dual-socket scaling #46123