Open DilumAluthge opened 3 years ago
If you want, you could also add Tullio.jl to the benchmark comparisons.
So something like:
And do dense matrix-matrix multiplication for a wide range of matrix sizes, and plot the GFLOPS.
I'm also working on multithreading in PaddedMatrices, so I would be inclined to add that and Tullio.jl as well.
The machines I have access to, in order of speed, are:
I'm also working on multithreading in PaddedMatrices, so I would be inclined to add that and Tullio.jl as well.
That sounds great.
It would be nice to have not only regular plots (matrix size versus GFLOPS), but also log-log plots, so we can see what happens at really really big matrices. Gaius.jl has some log-log plots in the README (although they plot runtime instead of GFLOPS).
Also, it would be good to add the code (for running the benchmarks and generating the plots) to this repository (maybe in a folder named bench
). That way other people can easily run the benchmarks on their own computers.
So all I have without AVX512 is the Haswell laptop.
I was experimenting with threading in PaddedMatrices. Unfortunately, the threading overhead -- at least of my implementation -- is extremely high.
Maybe there's a faster approach. A couple things I tried:
Example test:
using Base.Threads, BenchmarkTools
nthreads() > 1 || exit()
const XR = Ref(0)
const CHANNEL = Channel{Int}(1)
function f()
while true
XR[] += take!(CHANNEL)
end
end
t = Task(f)
t.sticky = true
ccall(:jl_set_task_tid, Cvoid, (Any, Cint), t, 1)
push!(@inbounds(Base.Workqueues[2]), t);
ccall(:jl_wakeup_thread, Cvoid, (Int16,), 1 % Int16)
@benchmark put!(CHANNEL, 1)
XR
The idea here is that we would create a pool of tasks, pinning them to specific threads.
For each one, we'd have a channel, and a function running a while true
loop like f
above, waiting on take!
from the channel.
Then, as needed, we'd put!
a tuple of pointers, strides, and array dims into a channel.
Once the function succeeds in take!
ing it, it'd run matmul on that block.
Alternatively, we can just create new tasks, which is what I've currently implemented in PaddedMatrices.jl:
function runfunc(t::Task, tid)
t.sticky = true
ccall(:jl_set_task_tid, Cvoid, (Any, Cint), t, tid)
push!(@inbounds(Base.Workqueues[tid+1]), t)
ccall(:jl_wakeup_thread, Cvoid, (Int16,), tid % Int16)
t
end
runfunc(func, tid) = runfunc(Task(func), tid)
g() = (XR[] += 1)
XR[] = 0;
@benchmark runfunc(g, 1)
XR
Note that these two examples aren't symmetric in the arguments. g
has to be a closure capturing the pointers, strides, and dims. The argument it is getting passed is which thread you want to run it on.
Benchmark results on the 10-core CPU:
julia> using Base.Threads, BenchmarkTools
julia> nthreads() > 1 || exit()
true
julia> const XR = Ref(0)
Base.RefValue{Int64}(0)
julia> const CHANNEL = Channel{Int}(1)
Channel{Int64}(1) (empty)
julia> function f()
while true
XR[] += take!(CHANNEL)
end
end
f (generic function with 1 method)
julia> t = Task(f)
Task (runnable) @0x00007f49c6e5d3c0
julia> t.sticky = true
true
julia> ccall(:jl_set_task_tid, Cvoid, (Any, Cint), t, 1)
julia> push!(@inbounds(Base.Workqueues[2]), t);
julia> ccall(:jl_wakeup_thread, Cvoid, (Int16,), 1 % Int16)
julia> @benchmark put!(CHANNEL, 1)
BenchmarkTools.Trial:
memory estimate: 56 bytes
allocs estimate: 1
--------------
minimum time: 2.029 μs (0.00% GC)
median time: 2.921 μs (0.00% GC)
mean time: 3.019 μs (0.00% GC)
maximum time: 6.396 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 9
julia> XR
Base.RefValue{Int64}(590502)
julia> function runfunc(t::Task, tid)
t.sticky = true
ccall(:jl_set_task_tid, Cvoid, (Any, Cint), t, tid)
push!(@inbounds(Base.Workqueues[tid+1]), t)
ccall(:jl_wakeup_thread, Cvoid, (Int16,), tid % Int16)
t
end
runfunc (generic function with 1 method)
julia> runfunc(func, tid) = runfunc(Task(func), tid)
runfunc (generic function with 2 methods)
julia> g() = (XR[] += 1)
g (generic function with 1 method)
julia> XR[] = 0;
julia> @benchmark runfunc(g, 1)
BenchmarkTools.Trial:
memory estimate: 417 bytes
allocs estimate: 4
--------------
minimum time: 140.067 ns (0.00% GC)
median time: 428.747 ns (0.00% GC)
mean time: 742.016 ns (34.62% GC)
maximum time: 117.666 μs (99.45% GC)
--------------
samples: 7089
evals/sample: 950
julia> XR
Base.RefValue{Int64}(7197704)
Seems to be a clear win for the task-approach. But, overhead seems incredibly high in practice.
And threading experts able to chime in on a way to make this faster?
I have code here in PaddedMatrices. It's descended from something @MasonProtter wrote, the (StructVector ∘ map)(sizes) do sz
at least is from his original script.
But an example of what I'm seeing:
julia> include("/home/chriselrod/.julia/dev/PaddedMatrices/benchmark/blasbench.jl")
plot (generic function with 3 methods)
julia> mkl_set_num_threads(18)
julia> openblas_set_num_threads(18)
julia> M = K = N = 80;
julia> A = rand(M,K); B = rand(K,N); C2 = A * B; C1 = similar(C2);
julia> @time(jmul!(C1, A, B)) ≈ C2 # `jmul!` is single threaded; time to first matmul is pretty bad
20.126093 seconds (17.65 M allocations: 1.094 GiB, 2.31% gc time, 100.00% compilation time)
true
julia> @time(jmult!(C1, A, B)) ≈ C2 # `jmult!` is multithreaded; uses mostly the same code as `jmul!` so not much more needed
0.278228 seconds (289.96 k allocations: 16.887 MiB, 43.93% gc time, 99.96% compilation time)
true
julia> @benchmark jmul!($C1, $A, $B) # single thread PaddedMatrices
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 8.625 μs (0.00% GC)
median time: 8.660 μs (0.00% GC)
mean time: 8.673 μs (0.00% GC)
maximum time: 29.816 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 3
julia> 2e-9M*K*N / 8.625e-6
118.72463768115942
julia> @benchmark jmult!($C1, $A, $B) # multithreaded PaddedMatrices
BenchmarkTools.Trial:
memory estimate: 1.58 KiB
allocs estimate: 19
--------------
minimum time: 8.793 μs (0.00% GC)
median time: 11.280 μs (0.00% GC)
mean time: 11.680 μs (0.00% GC)
maximum time: 42.975 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> 2e-9M*K*N / 8.793e-6
116.45627203457295
julia> @benchmark gemmopenblas!($C1, $A, $B) # OpenBLAS is using too many threads here
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 15.108 μs (0.00% GC)
median time: 32.217 μs (0.00% GC)
mean time: 31.904 μs (0.00% GC)
maximum time: 70.254 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> 2e-9M*K*N / 15.108e-6
67.77866031241726
julia> @benchmark gemmmkl!($C1, $A, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.479 μs (0.00% GC)
median time: 4.859 μs (0.00% GC)
mean time: 4.796 μs (0.00% GC)
maximum time: 13.709 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 8
julia> 2e-9M*K*N / 3.479e-6
294.3374532911756
Just no where close to MKL in multithreaded performance.
We're in the same ballpark as single threaded MKL:
julia> mkl_set_num_threads(1)
julia> @benchmark gemmmkl!($C1, $A, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 9.178 μs (0.00% GC)
median time: 9.225 μs (0.00% GC)
mean time: 9.242 μs (0.00% GC)
maximum time: 46.682 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> 2e-9M*K*N / 9.178e-6
111.57114839834387
But MKL gets a speed up form multithreading even at very small sizes. While PaddedMatrices.jl and OpenBLA suffer from heavy threading overhead.
Also note that threading requires PaddedMatrices master. I haven't released it yet, as I want to fiddle with the threading thresholds a little more.
But to be fair, performance isn't great at larger sizes either, so I can't blame everything on overhead. lol
And trying 10_000
x10_000
:
julia> M = K = N = 10_000;
julia> A = rand(M,K); B = rand(K,N); C2 = @time(A*B); C1 = similar(C2);
1.322118 seconds (2 allocations: 762.940 MiB, 2.40% gc time)
julia> fill!(C1,NaN); @time(jmult!(C1, A, B)) ≈ C2
1.144560 seconds (109 allocations: 9.875 KiB)
true
julia> 2e-9M*K*N / 1.14456
1747.3963793947019
julia> fill!(C1,NaN); @time(gemmopenblas!(C1, A, B)); C1 ≈ C2
1.120673 seconds
true
julia> 2e-9M*K*N / 1.120673
1784.6419071397277
julia> fill!(C1,NaN); @time(gemmmkl!(C1, A, B)); C1 ≈ C2
0.958816 seconds
true
julia> 2e-9M*K*N / 0.958816
2085.905950672496
Could you add Gaius, Tullio, and Octavian to that plot?
I'm not anticipating that Octavian's performance is very good at this point; I'm just curious where we are.
I ran those benchmarks last night. I'm creating a library now to better organize blas benchmarks.
Is this single-threaded or multi-threaded?
PaddedMatrices is holding up great, even at large matrix sizes.
The Tullio performance here seems a bit worse than in previous benchmarks (e.g. https://github.com/mcabbott/Tullio.jl/blob/master/benchmarks/02/matmul-0.2.5-Float64-1.5.0.png).
Octavian is slow, as expected, although surprising that it catches up to Gaius near the end.
Multi-threaded. The plot you shared only goes to 10^3 = 1000. Tullio only starts lagging beyond that point on my computer.
Impressive how close Tullio was in that benchmark. Note that PaddedMatrices was single threaded in it.
The performance ceiling on my cpu is also much higher. At 1000x1000, MKL was at around 1500 GFLOPS vs roughly 300 GFLOPS in that chart. I guess having three times the cores (18 vs 6) makes it more challenging to use them.
Also, another experiment: Replicating the Task API with channels:
using Base.Threads
nthreads() > 1 || exit()
const FCHANNEL = [Channel{Ptr{Cvoid}}(1) for _ ∈ 2:nthreads()];
function crun(chn)
f = take!(chn)
ccall(f, Cvoid, ())
end
struct Runner
tid::Int
end
function (r::Runner)()
chn = FCHANNEL[r.tid]
while true
crun(chn)
end
end
function crunfunc(f, tid)
put!(FCHANNEL[tid], Base.unsafe_convert(Ptr{Cvoid}, @cfunction($f, Cvoid, ())))
end
for i ∈ 1:nthreads()-1
t = Task(Runner(i))
t.sticky = true
ccall(:jl_set_task_tid, Cvoid, (Any, Cint), t, i % Cint)
push!(@inbounds(Base.Workqueues[i+1]), t);
ccall(:jl_wakeup_thread, Cvoid, (Int16,), i % Int16)
end
const XR = Ref{Int}(0);
g() = (XR[] += 1; nothing)
Gets the same performance as the other channel example, but would make it easier to try and switch between them.
Aside from the worse benchmarks -- which might just be because the put!
is blocking / not representative of the matmul workloads -- I also didn't initially try the channel approach because it seemed trickier to use for a variety of input types.
Obviously some adjustments still need to be made, like either using a Channel{CFunction}
or making sure I can GC.@preserve
the closure for its lifetime.
A smaller range of benchmarks on my laptop (4 cores), and the results look a lot more like Tullio's.
Except that my CPU is too new for OpenBLAS, so it uses antique kernels.
Octavian is slow, as expected, although surprising that it catches up to Gaius near the end.
Following the suggestions I outlined should help. The reason it eventually caught up with Gaius is that packing the arrays like Octavian does becomes increasingly important as the matrices get larger. MKL, OpenBLAS, and PaddedMatrices also pack.
Can you generate the same plots for Int32 and Int64?
I remember that Gaius does really well on integer matrices, so I'm curious to see how PaddedMatrices does.
I'll do so when I finish tuning.
BLAS does not support integer matrices, so anything based on LoopVectorization will be very fast compared to alternatives.
LinearAlgebra.mul!
will call a generic fallback.
BLAS does not support integer matrices
Just out of curiosity, is this only the case for the OpenBLAS that ships with Julia?
Or is it the case that, in general, OpenBLAS does not support integer matrices?
It's true in general. MKL does not either. FWIW, I also use the OpenBLAS that ships with OpenBLAS_jll for the benchmarks.
Returning to Floats:
According to the bigger plot that goes up to 10k, it really does seem like PaddedMatrices is able to hold its own against OpenBLAS and MKL even for really big matrices.
In your experiments, have you found a point yet when PaddedMatrices starts to lag?
Would be interesting to maybe have a log log plot with insanely big matrices (100k, maybe even 1 million??) to see if PaddedMatrices ever lags.
Because at least up to 10k, the different between PaddedMatrices and OpenBLAS is comparable to the difference between OpenBLAS and MKL. In other words, all three are in the same "league" of performance.... at least up to 10k.
I uploaded much better looking multithreading benchmarks here a few days ago, but I've still been experimenting with further improvements.
Also, RAM requirements for dense matrices of various sizes:
julia> (10_000^2 * 8 * 3) / (1 << 30)
2.2351741790771484
julia> (100_000^2 * 8 * 3) / (1 << 30)
223.51741790771484
julia> (1_000_000^2 * 8 * 3) / (1 << 30)
22351.741790771484
My computer has 128 GiB (or maybe its GB?) of RAM. If 120 GiB is used for three matrices, those matrices could be up to around 73k x 73k:
julia> sqrt(120 * (1<<30) / 24)
73271.47548671311
You've only gone and bloody done it.
PaddedMatrices really does seem to be on par with OpenBLAS.
But it varies by computer! I'll try again on the Haswell laptop before I issue a release, but performance has always been bad there.
those matrices could be up to around 73k x 73k:
Might as well try it ;)
Sure, I'll try. Note that even with 2000 GFLOPS, it'd take almost six and a half minutes:
julia> 2e-9 * 73_000^3 / 2000
389.017
My computer started overheating, so I took apart the water block and am cleaning it. Hopefully that solves the problem. There was a small amount of gunk in the fins, maybe that significantly blocked water from cooling the fins and thus the CPU. I want to install quick release fasteners, and have a couple arriving tomorrow, so I'll put it back together then. I should be able to resume benchmarking (and tuning) by Friday. If the problem isn't solved, I'll have to switch computers (from the 10980XE to the 7980XE, both have 18 cores, but the latter is slower and has never been able to hit the magic [round] 2000 GFLOPS in actual benchmarks).
EDIT: Parts did not actually arrive on 01/07:
In Transit, Arriving Late Your package will arrive later than expected, but is still on its way. It is currently in transit to the next facility.
If they're not here by Friday night, I'll just put it back together anyway and resume benchmarking. The parts (quick release fasteners) would only make taking it apart and putting back together easier, so I don't need them to actually run it.
EDIT: I did put it back together without the fasteners. It ran cool at first, but within a few hours of benchmarking it begain to overheat again. =/ I'll take it back apart and this time flush everything in hopes of it taking a while before clogging.
This was on my 7980XE, which is clocked lower than my 10980XE was (which was featured in earlier benchmarks), in case anyone notices the GFLOPS are generally a bit lower for MKL and OpenBLAS. Still having hardware trouble with the 10980XE.
The 7980XE was running at 3.4 GHz, which puts the theoretical peak GFLOPS at
julia> 3.4 * (8+8)*2 * 18 # 3.4 GHz (G cycles/core) * (8 adds/fma + 8 multiplications/fma) * 2 fma/cycle * 18 cores
1958.3999999999999
I can run the benchmarks on my Ryzen 4900HS if there's an updated benchmark script you can point me towards.
https://github.com/chriselrod/BLASBenchmarksCPU.jl
The docs also have a section on disabling turbo on Linux. I don't know other OSes, but it's definitely recommend for the sake of consistent multithreaded benchmarks.
Another option is just going into the BIOS, which I did for the desktop computers I benchmark on. Not as nice for laptops, since you'd want to be able to reenable boating without rebooting on a laptop (I just have it off permanently on the desktops).
There's also optionally a sleep_time
keyword argument to runbenchmark
for how long to sleep between benchmarks if you can't disable turbo. Probably slower and less reliable.
Ryzen 4000 series processors can apparently be boosted/de-boosted from within Windows 10’s power options menu. Benchmarks are running now–Octavian is looking superb at small-medium sizes so far.
My machine decided to restart in the middle of the night for a Windows update, so I'm re-running them, but the benchmarks are so good, it's hard to believe my eyes.
edit: added a run from N=1000-20000, where Octavian is no longer the clear winner (but within a stone's throw from the state of the art!)
edit 2: one more abridged benchmark, looking at the effect of setting julia -t 8
(corresponding to the number of physical cores) instead of -t 16
. The main takeaway seems to be that Tullio.jl is happier with nthreads=ncores, while Octavian can wring a bit more out of hyperthreading.
Here are some preliminary results on my Ryzen 5 2600. I must say I’m absolutely floored! Kudos @chriselrod and @DilumAluthge, this is amazing work!
The main takeaway seems to be that Tullio.jl is happier with nthreads=ncores, while Octavian can wring a bit more out of hyperthreading.
Actually, Octavian only uses up to ncores
threads. BLAS kernels tend to be able to max-out a core's execution units by themselves, so they don't benefit from hyperthreading.
But they need a lot of cache, so they tend to do better the less they have to compete with for cache space -- hence, one thread per core.
The block of A
being multiplied aims to take up about this fraction of the L2
cache:
julia> Octavian.R₁Default
0.5900561503730485
So two of them would take up about 120%, which would cause problems ;).
Of course, if tuned to take advantage of hyperthreading, these numbers would be smaller.
However, we'd really prefer it to be big: the bigger it is, the less often we have to iterate over sub-blocks of B
. We have to iterate over the subblock once per A-block in A.
And this is the fundamental problem with hyperthreading in BLAS: the more cache we can use, the less memory-related overhead we have.
A probably smaller issue is that for larger matrices, the different threads also have to start synchronizing. More threads means more synchronization overhead.
Because lots of code can benefit from hyperthreading, it makes sense to just limit the number of threads Octavian uses, so you don't need to penalize any other code you're running by starting Julia with less threads than that code wants.
@MasonProtter If you had doubts about upgrading to Zen3, compare the FLOPS the Zen2 laptop chip gets compared to your Zen1 desktop ;).
If you have hours of CPU time to burn, you could also try running this and see if it comes up with fairly different looking numbers than what it started with. Just set the time limit to how long you're willing to let it run. (Of course, very important to disable turbo and ensure consistent timing).
If you do, I can add a "if Zen1/Zen2" check to use those numbers instead. Eventually, it'd be good to improve the model it uses to decide blocking to hopefully no longer need different values for different CPUs, but for not that's unfortunately necessary.
Performance on the Ryzen 4900HS seems to start dipping a lot after 400x400 though, which is before that blocking behavior comes into play. That may be when it begins to pack A but not yet pack B -- it could probably use some tuning there as well.
In comparison, the Ryzen 5 2600's benchmark results are beautifully consistent, although they stopped at 500x500.
If you had doubts about upgrading to Zen3, compare the FLOPS the Zen2 laptop chip gets compared to your Zen1 desktop ;).
Lord, I know. I want it so bad.
In comparison, the Ryzen 5 2600's benchmark results are beautifully consistent, although they stopped at 500x500.
Updated to 750 x 750 now. I accidentally killed that julia session this morning without saving the dataframe to disk, so that's the most that graph will grow for now. I'll do a run from 751:1000 tonight though. Currently running the parameter optimizer.
Way ahead of you – I already got that going last night! I’m away from the computer right now, but I’ll let you know what it converged to once I’m home.
I did a 4 hour run and here's the result. I'll try doing a longer run later:
julia> opt.initial_x
4-element Vector{Float64}:
0.1
0.15989396641218157
0.4203583148344484
0.6344856142604789
julia> opt.minimizer
4-element Vector{Float64}:
0.053918949422353986
0.3013238122374886
0.6077103834481342
0.8775382433240162
Great, thanks. The file currently says (I'll edit it soon with the goal of making it so that you can compile Octavian into a sysimage on one computer and move it to another ans still have things work correctly):
if VectorizationBase.AVX512F
const W₁Default = 0.006089395198610773
const W₂Default = 0.7979822724696168
const R₁Default = 0.5900561503730485
const R₂Default = 0.762152930709678
else
const W₁Default = 0.1 # TODO: relax bounds; this was the upper bound set for the optimizer.
const W₂Default = 0.15989396641218157
const R₁Default = 0.4203583148344484
const R₂Default = 0.6344856142604789
end
Those values are fairly different from both of the above.
When you do restart it, be sure to start with opt.minimizer
as your new initial values.
Out of curiosity, what was the difference in the objective between initial_x
and opt.minimizer
(i.e., by how much did the gflops improve)?
Updated to 750 x 750 now. I accidentally killed that julia session this morning without saving the dataframe to disk, so that's the most that graph will grow for now. I'll do a run from 751:1000 tonight though. Currently running the parameter optimizer.
Cool. I ran it for like 3 days on two two computers, but I think the returns diminish pretty fast. It'll never converge, and just chase noise forever. Not sure if there's a better optimization method when your objective is noisy. BlackBoxOptim doesn't let you specify initial values, so it'd take a lot longer.
Good work guys, some very impressive graphs here.
Octavian only uses up to ncores threads.
Are you getting this from Hwloc.jl? (And is this recommended over CpuId.jl?)
Good work guys, some very impressive graphs here.
The two keys to performance at small sizes are:
The latter does not have an API yet, but you may want to consider it for Tullio.jl, as it should let you match Octavian.jl until we get to matrices big enough to require packing.
julia> using Octavian
julia> M = K = N = 48;
julia> A = rand(M,K); B = rand(K,N); C2 = @time(A*B); C1 = similar(C2);
0.585104 seconds (2.36 M allocations: 127.083 MiB, 20.39% gc time, 99.94% compilation time)
julia> @time(matmul!(C1,A,B)) ≈ C2
17.015580 seconds (17.86 M allocations: 1.094 GiB, 3.00% gc time)
true
julia> @benchmark matmul!($C1,$A,$B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.240 μs (0.00% GC)
median time: 1.322 μs (0.00% GC)
mean time: 1.325 μs (0.00% GC)
maximum time: 6.318 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
julia> 2e-9M*K*N / 1.24e-6 # calculate GFLOPS
178.37419354838713
julia> M = K = N = 72;
julia> A = rand(M,K); B = rand(K,N); C2 = @time(A*B); C1 = similar(C2);
0.000462 seconds (2 allocations: 40.578 KiB)
julia> @time(matmul!(C1,A,B)) ≈ C2
0.000051 seconds
true
julia> @benchmark matmul!($C1,$A,$B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.112 μs (0.00% GC)
median time: 2.242 μs (0.00% GC)
mean time: 2.257 μs (0.00% GC)
maximum time: 8.763 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 9
julia> 2e-9M*K*N / 2.112e-6 # calculate GFLOPS
353.45454545454555
julia> M = K = N = 144;
julia> A = rand(M,K); B = rand(K,N); C2 = @time(A*B); C1 = similar(C2);
0.000151 seconds (2 allocations: 162.078 KiB)
julia> @time(matmul!(C1,A,B)) ≈ C2
0.000132 seconds
true
julia> @benchmark matmul!($C1,$A,$B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 5.343 μs (0.00% GC)
median time: 5.702 μs (0.00% GC)
mean time: 5.720 μs (0.00% GC)
maximum time: 24.244 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 6
julia> 2e-9M*K*N / 5.343e-6 # calculate GFLOPS
1117.71813587872
julia> versioninfo()
Julia Version 1.7.0-DEV.367
Commit b4547666b3* (2021-01-24 14:14 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
JULIA_NUM_THREADS = 36
julia> @benchmark wait(Threads.@spawn nothing) # just spawning a thread and waiting on it
BenchmarkTools.Trial:
memory estimate: 443 bytes
allocs estimate: 4
--------------
minimum time: 1.811 μs (0.00% GC)
median time: 7.741 μs (0.00% GC)
mean time: 9.088 μs (0.00% GC)
maximum time: 57.478 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 7
LoopVectorization handles small matrices really well and has dominated single threaded benchmarks for small matrices for quite a while. So all we need is low overhead threading to win multithreaded benchmarks. But just spawning a single task takes 9 microseconds on average, over 50% longer than it takes to multiply two 144x144 matrices which -- as you can tell from the fact it's getting over 1 teraflop -- requires more than 1 thread.
@spawn
is just way too slow. Allocates memory and high variance in how long it takes hurt too.
Are you getting this from Hwloc.jl?
Yes. VectorizationBase.jl gets it from Hwloc.jl.
On the latest release, it defines a global constant NUM_CORES
, but to get around issues of building VectorizationBase into a sysimage, it's changing to a generated function very soon:
julia> VectorizationBase.num_cores()
4
julia> @code_typed VectorizationBase.num_cores()
CodeInfo(
1 ─ return 4
) => Int64
that should still be constant, but just not compile until called (and thus hopefully avoid getting baked into a sysimage).
Oh nice, I should see if I can understand ThreadingUtilities.jl (at some point). From my much cruder tuning, I think I delayed multi-threading until size 70 or so, because of this @spawn
overhead. But you're clearly getting a benefit long before that.
Does ThreadingUtilities.jl threading lose composability with task based threaded code ?
Does ThreadingUtilities.jl threading lose composability with task based threaded code ?
"Yes."
Note that if you call Octavian.matmul!
from a thread, it will run single-threaded. This is probably what you want anyway, in the case of matrix multiply.
I think more broadly, in the cases where you want low-overheaded threading, you're probably happy to not thread the code if it's nested inside other threaded code.
I put yes in quotes because ThreadingUtilities.jl doesn't implement a scheduler. Currently, it gives you the ability to run code on specific threads. So you control the behavior, and can make it do what you want.
Oh nice, I should see if I can understand ThreadingUtilities.jl (at some point). From my much cruder tuning, I think I delayed multi-threading until size 70 or so, because of this
@spawn
overhead. But you're clearly getting a benefit long before that.
Octavian is currently delaying multithreading until
(M*K*N < (13824 * W))
If the matrices are square and their eltype is Float64
, then that is
julia> cbrt(13824 * 8)
48.0
julia> cbrt(13824 * 4)
38.097625247236785
48x48 and 39x39 for AVX512 and AVX, respectively.
It could turn on sooner than that. There is a sizeable jump in all the charts when it turns on.
I ended up with the following after 8 hours:
julia> opt.minimizer
4-element Vector{Float64}:
0.1
0.993489411720157
0.6052218809954467
0.7594052633561165
julia> opt.minimum
-233.16960038079367
...compared to ~209 gflops before tuning. Strange that it's so different from @MasonProtter's optimum on close to the same architecture.
Out of curiosity, what was the difference in the objective between initial_x and opt.minimizer (i.e., by how much did the gflops improve)?
@chriselrod Here you go:
julia> matmul_objective(opt.initial_x)
Params: [0.1, 0.15989396641218157, 0.4203583148344484, 0.6344856142604789]; 137.2842774319363
-137.2842774319363
julia> matmul_objective(opt.minimizer)
Params: [0.053918949422353986, 0.3013238122374886, 0.6077103834481342, 0.8775382433240162]; 141.2232512813757
-141.2232512813757
Squeezed out an extra 4 GFLOPS, not bad.
Strange that it's so different from MasonProtter's optimum on close to the same architecture.
Nah, my CPU is Zen+ which is basically Zen 1. Architecturally, it's very different from Zen 2 and Zen 3.
julia> opt.minimizer
4-element Vector{Float64}:
0.1
The upper bound of 0.1
isn't a hard bound. The first two parameters are just > 0, while the last two are between 0 and 1.
If you feel like running it again, you could increase the upper bound on the first parameter to 0.2, or maybe even higher.
...compared to ~209 gflops before tuning. Strange that it's so different from @MasonProtter's optimum on close to the same architecture.
That's quite the improvement, but even 233 looks worse than what we see in the graph, where it seemed to be at 240 for the larger sizes.
Strange that it's so different from MasonProtter's optimum on close to the same architecture.
The major difference between Zen1 and Zen2 is that Zen1 only had 256-bit SIMD units, and emulated 256-bit by having two 128-bit units work together. Meaning it had half the throughput for 256-bit instructions. Zen2 can do two 128 bit fma/cycle, or two 256 bit fma/cycle. Zen1 can do two 128 bit fma/cycle, or one 256 bit fma/cycle.
So in terms of tuning, that means Zen1 is bottlenecked more by execution units than by memory.
Low values of the first two numbers means than Zen1
likes larger values of K_c
; it wants to spend more time in each particular microkernel call. This comes at the cost of iterating over the matrices more often, but if you're bottlenecked most by execution units, perhaps that doesn't matter.
I poked through the benchmark code, and it seems to be averaging over a cache-size-dependent range of matrix sizes. Depending on how those are distributed, it's easy enough to imagine the small, relatively low-throughput matrices dragging the average down.
I restarted it from the prior optimum with the first parameter's upper bound set to 0.3. Will update tomorrow morning.
edit: optimizer seems to get hung up on local optima when using the default of 9 particles. I'll try bumping the number of particles to 4^3 so a bigger chunk of the search space gets seeded.
Last night's run arrived at a very different optimum. I typo'd the initialization, so it ran with only three particles (I think?). It's hard to be too confident in these results when there's so much noise in the objective.
julia> opt2.minimizer
4-element Vector{Float64}:
0.2559526669806126
0.01983770991788305
0.8217366097415788
0.7832307773033691
julia> opt2.minimum
-228.20806080690062
I wonder if it would be good to get some initial benchmarks done to see how we compare to OpenBLAS, MKL, and Gaius.jl.
@chriselrod I think you have access to a wider range of machines than me. Want to write up some basic benchmarks and run them on a variety of machines, and then generate plots showing the GFLOPS of Octavian versus OpenBLAS, MKL, and Gaius.jl?