Open carstenbauer opened 2 years ago
@chriselrod Are you able to reproduce this on some of our machines?
@chriselrod Are you able to reproduce this on some of our machines?
Not really on deapsea2, which I believe is 32 cores x 2 sockets:
julia> using LinearAlgebra
julia> M = K = N = 16_000;
julia> A = Matrix{Float64}(undef, M, K); B = Matrix{Float64}(undef, K, N); C = Matrix{Float64}(undef, M, N);
julia> let A=A; Threads.@threads for i = eachindex(A); @inbounds A[i] = randn(); end; end;
julia> let A=B; Threads.@threads for i = eachindex(A); @inbounds A[i] = randn(); end; end;
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
996.8121355318401
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1052.3059335030646
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1158.2279430700944
julia> BLAS.set_num_threads(64);
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1424.0752368319172
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1597.7102493574532
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1599.5668391744443
julia> using MKL
julia> BLAS.get_num_threads()
64
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1566.9680236220754
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1688.7183359779663
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1694.5279138377305
julia> BLAS.set_num_threads(32);
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1168.313201204457
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1151.4661029505437
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1112.8236207523348
MKL and OpenBLAS both show about the same speedup.
I used Threads.@threads
to initialize the arrays because of first touch.
On deapsea2, Octavian thinks we only have 32 cores, and thus refuses to use more than 32 threads. Switching back to hwloc instead of CpuId would fix this:
julia> using Hwloc
julia> Hwloc.num_numa_nodes()
8
julia> Hwloc.num_physical_cores()
64
julia> Hwloc.num_packages()
2
if I recall correctly, we no longer need to support WINE, so I think switching back would be okay.
CpuId seems to figure out we have 32 cores per socket
julia> using CpuId, Octavian
julia> CpuId.cpucores()
32
julia> Octavian.num_cores()
static(32)
but doesn't realize we have two of them.
julia> CpuId.cpunodes()
1
julia> CpuId.cpucores_total()
32
BTW, @carstenbauer also gave me access to the cluster he's running on, but I've been procrastinating on actually signing on and poking around.
julia> using LinearAlgebra
julia> BLAS.get_num_threads()
64
julia> M = K = N = 16_000;
julia> A = Matrix{Float64}(undef, M, K); B = Matrix{Float64}(undef, K, N); C = Matrix{Float64}(undef, M, N);
julia> let A=A; Threads.@threads for i = eachindex(A); @inbounds A[i] = randn(); end; end;
julia> let A=B; Threads.@threads for i = eachindex(A); @inbounds A[i] = randn(); end; end;
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
982.1094539637073
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1016.5287429664286
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
993.1180919251972
julia> BLAS.set_num_threads(128);
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1253.0140262394689
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1224.9438668463697
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1216.3002770861356
julia> using MKL
julia> BLAS.get_num_threads()
64
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1299.188240131717
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1383.8168443837112
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1309.8513547840373
julia> BLAS.set_num_threads(128);
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1340.979151645549
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1308.8692178797796
julia> 2e-9M*K*N/@elapsed(mul!(C,A,B))
1325.0394458234089
julia> BLAS.get_num_threads()
64
This time, CpuId gets it right, and Hwloc gets it wrong???
julia> using Hwloc
julia> Hwloc.num_physical_cores()
64
julia> Hwloc.num_packages()
1
julia> using CpuId
julia> CpuId.cpucores()
64
julia> CpuId.cpucores_total()
128
FWIW, I ran (some of) the same benchmarks on a qualitatively different system (Noctua 2, AMD Milan Zen 3, 2x 64 cores) and also found that the default OpenBLAS doesn't scale. See https://github.com/carstenbauer/julia-dgemm-noctua#noctua-2. (Chris has access to both Noctua 1 and Nocuta 2)
(Might be able to add results from clusters at other HPC centers soon)
@chriselrod Which system is this run on? If it's a Noctua 2 login node, be aware that the login nodes in Noctua 2 are special in the sense that they only have 1x64 physical cores but HT enabled (i.e. 128 virtual cores). OTOH, compute nodes have 2x64 physical cores and HT disabled (i.e. 128 physical == virtual cores).
Also, for completeness, which Julia version and how many Julia threads are you using?
From https://github.com/carstenbauer/julia-dgemm-noctua/issues/2. (cc @ViralBShah)
I benchmarked a simple
dgemm
call (i.e.mul!(C,A,B)
) on Noctua 1 (single and dual-socket Intel Xeon Gold "Skylake" 6148 20-Core CPUs) for multiple BLAS libraries (called from Julia using LBT)The custom OpenBLAS has been compiled with
Primary observations/conclusions:
using OpenBLAS_jll
andBLAS.lbt_forward(...; clear=true)
? (It's still worse than a custom build of OpenBLAS though.)I hope we can improve the default OpenBLAS performance and scaling.