JuliaLinearAlgebra / Octavian.jl

Multi-threaded BLAS-like library that provides pure Julia matrix multiplication
https://julialinearalgebra.github.io/Octavian.jl/stable/
Other
230 stars 18 forks source link

Some initial benchmark comparisons? #24

Open DilumAluthge opened 3 years ago

DilumAluthge commented 3 years ago

I wonder if it would be good to get some initial benchmarks done to see how we compare to OpenBLAS, MKL, and Gaius.jl.

@chriselrod I think you have access to a wider range of machines than me. Want to write up some basic benchmarks and run them on a variety of machines, and then generate plots showing the GFLOPS of Octavian versus OpenBLAS, MKL, and Gaius.jl?

DilumAluthge commented 3 years ago

If you want, you could also add Tullio.jl to the benchmark comparisons.

So something like:

And do dense matrix-matrix multiplication for a wide range of matrix sizes, and plot the GFLOPS.

chriselrod commented 3 years ago

I'm also working on multithreading in PaddedMatrices, so I would be inclined to add that and Tullio.jl as well.

The machines I have access to, in order of speed, are:

  1. Haswell laptop. 2 cores, 1.7 GHz, AVX2
  2. Tiger Lake laptop. 4 cores. boosts up to 4.7, but it probably won't be close when doing threaded BLAS. AVX512, 1-fma unit.
  3. Skylake-AVX512: 10 cores, didn't really mess with the BIOS settings.
  4. Skylake-AVX512: 18 cores, 3.7 GHz AVX512 all core (or maybe it was 3.6 GHz). Theoretically should hit >2 teraflops.
  5. Cascadelake: 18 cores, 4.1 GHz AVX512 all core. MKL does in fact hit over 2 teraflops.
DilumAluthge commented 3 years ago

I'm also working on multithreading in PaddedMatrices, so I would be inclined to add that and Tullio.jl as well.

That sounds great.

It would be nice to have not only regular plots (matrix size versus GFLOPS), but also log-log plots, so we can see what happens at really really big matrices. Gaius.jl has some log-log plots in the README (although they plot runtime instead of GFLOPS).

DilumAluthge commented 3 years ago

Also, it would be good to add the code (for running the benchmarks and generating the plots) to this repository (maybe in a folder named bench). That way other people can easily run the benchmarks on their own computers.

chriselrod commented 3 years ago

So all I have without AVX512 is the Haswell laptop.

I was experimenting with threading in PaddedMatrices. Unfortunately, the threading overhead -- at least of my implementation -- is extremely high.

Maybe there's a faster approach. A couple things I tried:

Thread Pool, wake up/give tasks with channels

Example test:

using Base.Threads, BenchmarkTools
nthreads() > 1 || exit()
const XR = Ref(0)
const CHANNEL = Channel{Int}(1)

function f()
    while true
        XR[] += take!(CHANNEL)
    end
end

t = Task(f)
t.sticky = true

ccall(:jl_set_task_tid, Cvoid, (Any, Cint), t, 1)
push!(@inbounds(Base.Workqueues[2]), t);
ccall(:jl_wakeup_thread, Cvoid, (Int16,), 1 % Int16)

@benchmark put!(CHANNEL, 1)
XR

The idea here is that we would create a pool of tasks, pinning them to specific threads. For each one, we'd have a channel, and a function running a while true loop like f above, waiting on take! from the channel.

Then, as needed, we'd put! a tuple of pointers, strides, and array dims into a channel. Once the function succeeds in take!ing it, it'd run matmul on that block.

Alternatively, we can just create new tasks, which is what I've currently implemented in PaddedMatrices.jl:

function runfunc(t::Task, tid)
    t.sticky = true
    ccall(:jl_set_task_tid, Cvoid, (Any, Cint), t, tid)
    push!(@inbounds(Base.Workqueues[tid+1]), t)
    ccall(:jl_wakeup_thread, Cvoid, (Int16,), tid % Int16)
    t
end
runfunc(func, tid) = runfunc(Task(func), tid)
g() = (XR[] += 1)
XR[] = 0;
@benchmark runfunc(g, 1)
XR

Note that these two examples aren't symmetric in the arguments. g has to be a closure capturing the pointers, strides, and dims. The argument it is getting passed is which thread you want to run it on.

Benchmark results on the 10-core CPU:

julia> using Base.Threads, BenchmarkTools

julia> nthreads() > 1 || exit()
true

julia> const XR = Ref(0)
Base.RefValue{Int64}(0)

julia> const CHANNEL = Channel{Int}(1)
Channel{Int64}(1) (empty)

julia> function f()
           while true
               XR[] += take!(CHANNEL)
           end
       end
f (generic function with 1 method)

julia> t = Task(f)
Task (runnable) @0x00007f49c6e5d3c0

julia> t.sticky = true
true

julia> ccall(:jl_set_task_tid, Cvoid, (Any, Cint), t, 1)

julia> push!(@inbounds(Base.Workqueues[2]), t);

julia> ccall(:jl_wakeup_thread, Cvoid, (Int16,), 1 % Int16)

julia> @benchmark put!(CHANNEL, 1)
BenchmarkTools.Trial:
  memory estimate:  56 bytes
  allocs estimate:  1
  --------------
  minimum time:     2.029 μs (0.00% GC)
  median time:      2.921 μs (0.00% GC)
  mean time:        3.019 μs (0.00% GC)
  maximum time:     6.396 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> XR
Base.RefValue{Int64}(590502)

julia> function runfunc(t::Task, tid)
           t.sticky = true
           ccall(:jl_set_task_tid, Cvoid, (Any, Cint), t, tid)
           push!(@inbounds(Base.Workqueues[tid+1]), t)
           ccall(:jl_wakeup_thread, Cvoid, (Int16,), tid % Int16)
           t
       end
runfunc (generic function with 1 method)

julia> runfunc(func, tid) = runfunc(Task(func), tid)
runfunc (generic function with 2 methods)

julia> g() = (XR[] += 1)
g (generic function with 1 method)

julia> XR[] = 0;

julia> @benchmark runfunc(g, 1)
BenchmarkTools.Trial:
  memory estimate:  417 bytes
  allocs estimate:  4
  --------------
  minimum time:     140.067 ns (0.00% GC)
  median time:      428.747 ns (0.00% GC)
  mean time:        742.016 ns (34.62% GC)
  maximum time:     117.666 μs (99.45% GC)
  --------------
  samples:          7089
  evals/sample:     950

julia> XR
Base.RefValue{Int64}(7197704)

Seems to be a clear win for the task-approach. But, overhead seems incredibly high in practice.

And threading experts able to chime in on a way to make this faster?

chriselrod commented 3 years ago

I have code here in PaddedMatrices. It's descended from something @MasonProtter wrote, the (StructVector ∘ map)(sizes) do sz at least is from his original script.

chriselrod commented 3 years ago

But an example of what I'm seeing:

julia> include("/home/chriselrod/.julia/dev/PaddedMatrices/benchmark/blasbench.jl")
plot (generic function with 3 methods)

julia> mkl_set_num_threads(18)

julia> openblas_set_num_threads(18)

julia> M = K = N = 80;

julia> A = rand(M,K); B = rand(K,N); C2 = A * B; C1 = similar(C2);

julia> @time(jmul!(C1, A, B)) ≈ C2 # `jmul!` is single threaded; time to first matmul is pretty bad
 20.126093 seconds (17.65 M allocations: 1.094 GiB, 2.31% gc time, 100.00% compilation time)
true

julia> @time(jmult!(C1, A, B)) ≈ C2 # `jmult!` is multithreaded; uses mostly the same code as `jmul!` so not much more needed
  0.278228 seconds (289.96 k allocations: 16.887 MiB, 43.93% gc time, 99.96% compilation time)
true

julia> @benchmark jmul!($C1, $A, $B) # single thread PaddedMatrices
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.625 μs (0.00% GC)
  median time:      8.660 μs (0.00% GC)
  mean time:        8.673 μs (0.00% GC)
  maximum time:     29.816 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     3

julia> 2e-9M*K*N / 8.625e-6
118.72463768115942

julia> @benchmark jmult!($C1, $A, $B) # multithreaded PaddedMatrices
BenchmarkTools.Trial:
  memory estimate:  1.58 KiB
  allocs estimate:  19
  --------------
  minimum time:     8.793 μs (0.00% GC)
  median time:      11.280 μs (0.00% GC)
  mean time:        11.680 μs (0.00% GC)
  maximum time:     42.975 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> 2e-9M*K*N / 8.793e-6
116.45627203457295

julia> @benchmark gemmopenblas!($C1, $A, $B) # OpenBLAS is using too many threads here
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     15.108 μs (0.00% GC)
  median time:      32.217 μs (0.00% GC)
  mean time:        31.904 μs (0.00% GC)
  maximum time:     70.254 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> 2e-9M*K*N / 15.108e-6
67.77866031241726

julia> @benchmark gemmmkl!($C1, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.479 μs (0.00% GC)
  median time:      4.859 μs (0.00% GC)
  mean time:        4.796 μs (0.00% GC)
  maximum time:     13.709 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8

julia> 2e-9M*K*N / 3.479e-6
294.3374532911756

Just no where close to MKL in multithreaded performance.

We're in the same ballpark as single threaded MKL:

julia> mkl_set_num_threads(1)

julia> @benchmark gemmmkl!($C1, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.178 μs (0.00% GC)
  median time:      9.225 μs (0.00% GC)
  mean time:        9.242 μs (0.00% GC)
  maximum time:     46.682 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> 2e-9M*K*N / 9.178e-6
111.57114839834387

But MKL gets a speed up form multithreading even at very small sizes. While PaddedMatrices.jl and OpenBLA suffer from heavy threading overhead.

Also note that threading requires PaddedMatrices master. I haven't released it yet, as I want to fiddle with the threading thresholds a little more.

chriselrod commented 3 years ago

But to be fair, performance isn't great at larger sizes either, so I can't blame everything on overhead. lol gemmFloat64_2_4000_cascadelake_AVX512_18thread

And trying 10_000x10_000:

julia> M = K = N = 10_000;

julia> A = rand(M,K); B = rand(K,N); C2 = @time(A*B); C1 = similar(C2);
  1.322118 seconds (2 allocations: 762.940 MiB, 2.40% gc time)

julia> fill!(C1,NaN); @time(jmult!(C1, A, B)) ≈ C2
  1.144560 seconds (109 allocations: 9.875 KiB)
true

julia> 2e-9M*K*N / 1.14456
1747.3963793947019

julia> fill!(C1,NaN); @time(gemmopenblas!(C1, A, B)); C1 ≈ C2
  1.120673 seconds
true

julia> 2e-9M*K*N / 1.120673
1784.6419071397277

julia> fill!(C1,NaN); @time(gemmmkl!(C1, A, B)); C1 ≈ C2
  0.958816 seconds
true

julia> 2e-9M*K*N / 0.958816
2085.905950672496
DilumAluthge commented 3 years ago

Could you add Gaius, Tullio, and Octavian to that plot?

I'm not anticipating that Octavian's performance is very good at this point; I'm just curious where we are.

chriselrod commented 3 years ago

I ran those benchmarks last night. I'm creating a library now to better organize blas benchmarks.

chriselrod commented 3 years ago

gemm_Float64_10_10000_cascadelake_AVX512__multithreaded_logscale https://github.com/chriselrod/BLASBenchmarks.jl

DilumAluthge commented 3 years ago

Is this single-threaded or multi-threaded?

PaddedMatrices is holding up great, even at large matrix sizes.

The Tullio performance here seems a bit worse than in previous benchmarks (e.g. https://github.com/mcabbott/Tullio.jl/blob/master/benchmarks/02/matmul-0.2.5-Float64-1.5.0.png).

Octavian is slow, as expected, although surprising that it catches up to Gaius near the end.

chriselrod commented 3 years ago

Multi-threaded. The plot you shared only goes to 10^3 = 1000. Tullio only starts lagging beyond that point on my computer.

Impressive how close Tullio was in that benchmark. Note that PaddedMatrices was single threaded in it.

The performance ceiling on my cpu is also much higher. At 1000x1000, MKL was at around 1500 GFLOPS vs roughly 300 GFLOPS in that chart. I guess having three times the cores (18 vs 6) makes it more challenging to use them.

chriselrod commented 3 years ago

Also, another experiment: Replicating the Task API with channels:

using Base.Threads
nthreads() > 1 || exit()
const FCHANNEL = [Channel{Ptr{Cvoid}}(1) for _ ∈ 2:nthreads()];
function crun(chn)
    f = take!(chn)
    ccall(f, Cvoid, ())
end
struct Runner
    tid::Int
end
function (r::Runner)()
    chn = FCHANNEL[r.tid]
    while true
        crun(chn)
    end
end

function crunfunc(f, tid)
    put!(FCHANNEL[tid], Base.unsafe_convert(Ptr{Cvoid}, @cfunction($f, Cvoid, ())))
end

for i ∈ 1:nthreads()-1
    t = Task(Runner(i))
    t.sticky = true
    ccall(:jl_set_task_tid, Cvoid, (Any, Cint), t, i % Cint)
    push!(@inbounds(Base.Workqueues[i+1]), t);
    ccall(:jl_wakeup_thread, Cvoid, (Int16,), i % Int16)
end

const XR = Ref{Int}(0);
g() = (XR[] += 1; nothing)

Gets the same performance as the other channel example, but would make it easier to try and switch between them.

Aside from the worse benchmarks -- which might just be because the put! is blocking / not representative of the matmul workloads -- I also didn't initially try the channel approach because it seemed trickier to use for a variety of input types. Obviously some adjustments still need to be made, like either using a Channel{CFunction} or making sure I can GC.@preserve the closure for its lifetime.

chriselrod commented 3 years ago

gemm_Float64_10_1000_tigerlake_AVX512__multithreaded_logscale A smaller range of benchmarks on my laptop (4 cores), and the results look a lot more like Tullio's.

Except that my CPU is too new for OpenBLAS, so it uses antique kernels.

Octavian is slow, as expected, although surprising that it catches up to Gaius near the end.

Following the suggestions I outlined should help. The reason it eventually caught up with Gaius is that packing the arrays like Octavian does becomes increasingly important as the matrices get larger. MKL, OpenBLAS, and PaddedMatrices also pack.

DilumAluthge commented 3 years ago

Can you generate the same plots for Int32 and Int64?

I remember that Gaius does really well on integer matrices, so I'm curious to see how PaddedMatrices does.

chriselrod commented 3 years ago

I'll do so when I finish tuning.

BLAS does not support integer matrices, so anything based on LoopVectorization will be very fast compared to alternatives.

LinearAlgebra.mul! will call a generic fallback.

DilumAluthge commented 3 years ago

BLAS does not support integer matrices

Just out of curiosity, is this only the case for the OpenBLAS that ships with Julia?

Or is it the case that, in general, OpenBLAS does not support integer matrices?

chriselrod commented 3 years ago

It's true in general. MKL does not either. FWIW, I also use the OpenBLAS that ships with OpenBLAS_jll for the benchmarks.

DilumAluthge commented 3 years ago

Returning to Floats:

According to the bigger plot that goes up to 10k, it really does seem like PaddedMatrices is able to hold its own against OpenBLAS and MKL even for really big matrices.

In your experiments, have you found a point yet when PaddedMatrices starts to lag?

Would be interesting to maybe have a log log plot with insanely big matrices (100k, maybe even 1 million??) to see if PaddedMatrices ever lags.

Because at least up to 10k, the different between PaddedMatrices and OpenBLAS is comparable to the difference between OpenBLAS and MKL. In other words, all three are in the same "league" of performance.... at least up to 10k.

chriselrod commented 3 years ago

I uploaded much better looking multithreading benchmarks here a few days ago, but I've still been experimenting with further improvements.

Also, RAM requirements for dense matrices of various sizes:

julia> (10_000^2 * 8 * 3) / (1 << 30)
2.2351741790771484

julia> (100_000^2 * 8 * 3) / (1 << 30)
223.51741790771484

julia> (1_000_000^2 * 8 * 3) / (1 << 30)
22351.741790771484

My computer has 128 GiB (or maybe its GB?) of RAM. If 120 GiB is used for three matrices, those matrices could be up to around 73k x 73k:

julia> sqrt(120 * (1<<30) / 24)
73271.47548671311
DilumAluthge commented 3 years ago

You've only gone and bloody done it.

DilumAluthge commented 3 years ago

PaddedMatrices really does seem to be on par with OpenBLAS.

chriselrod commented 3 years ago

But it varies by computer! I'll try again on the Haswell laptop before I issue a release, but performance has always been bad there.

DilumAluthge commented 3 years ago

those matrices could be up to around 73k x 73k:

Might as well try it ;)

chriselrod commented 3 years ago

Sure, I'll try. Note that even with 2000 GFLOPS, it'd take almost six and a half minutes:

julia> 2e-9 * 73_000^3 / 2000
389.017

My computer started overheating, so I took apart the water block and am cleaning it. Hopefully that solves the problem. There was a small amount of gunk in the fins, maybe that significantly blocked water from cooling the fins and thus the CPU. I want to install quick release fasteners, and have a couple arriving tomorrow, so I'll put it back together then. I should be able to resume benchmarking (and tuning) by Friday. If the problem isn't solved, I'll have to switch computers (from the 10980XE to the 7980XE, both have 18 cores, but the latter is slower and has never been able to hit the magic [round] 2000 GFLOPS in actual benchmarks).

EDIT: Parts did not actually arrive on 01/07:

In Transit, Arriving Late Your package will arrive later than expected, but is still on its way. It is currently in transit to the next facility.

If they're not here by Friday night, I'll just put it back together anyway and resume benchmarking. The parts (quick release fasteners) would only make taking it apart and putting back together easier, so I don't need them to actually run it.

EDIT: I did put it back together without the fasteners. It ran cool at first, but within a few hours of benchmarking it begain to overheat again. =/ I'll take it back apart and this time flush everything in hopes of it taking a while before clogging.

chriselrod commented 3 years ago

image

chriselrod commented 3 years ago

This was on my 7980XE, which is clocked lower than my 10980XE was (which was featured in earlier benchmarks), in case anyone notices the GFLOPS are generally a bit lower for MKL and OpenBLAS. Still having hardware trouble with the 10980XE.

The 7980XE was running at 3.4 GHz, which puts the theoretical peak GFLOPS at

julia> 3.4 * (8+8)*2 * 18 # 3.4 GHz (G cycles/core) * (8 adds/fma + 8 multiplications/fma) * 2 fma/cycle * 18 cores
1958.3999999999999
stillyslalom commented 3 years ago

I can run the benchmarks on my Ryzen 4900HS if there's an updated benchmark script you can point me towards.

chriselrod commented 3 years ago

https://github.com/chriselrod/BLASBenchmarksCPU.jl The docs also have a section on disabling turbo on Linux. I don't know other OSes, but it's definitely recommend for the sake of consistent multithreaded benchmarks. Another option is just going into the BIOS, which I did for the desktop computers I benchmark on. Not as nice for laptops, since you'd want to be able to reenable boating without rebooting on a laptop (I just have it off permanently on the desktops). There's also optionally a sleep_time keyword argument to runbenchmark for how long to sleep between benchmarks if you can't disable turbo. Probably slower and less reliable.

stillyslalom commented 3 years ago

Ryzen 4000 series processors can apparently be boosted/de-boosted from within Windows 10’s power options menu. Benchmarks are running now–Octavian is looking superb at small-medium sizes so far.

stillyslalom commented 3 years ago

My machine decided to restart in the middle of the night for a Windows update, so I'm re-running them, but the benchmarks are so good, it's hard to believe my eyes. image

stillyslalom commented 3 years ago

gemm_Float64_2_4000_znver2_AVX2__multithreaded_logscale

edit: added a run from N=1000-20000, where Octavian is no longer the clear winner (but within a stone's throw from the state of the art!) gemm_Float64_1000_20000_znver2_AVX2__multithreaded_logscale

edit 2: one more abridged benchmark, looking at the effect of setting julia -t 8 (corresponding to the number of physical cores) instead of -t 16. The main takeaway seems to be that Tullio.jl is happier with nthreads=ncores, while Octavian can wring a bit more out of hyperthreading. gemm_Float64_100_4000_znver2_AVX2__multithreaded_logscale

MasonProtter commented 3 years ago

Here are some preliminary results on my Ryzen 5 2600. I must say I’m absolutely floored! Kudos @chriselrod and @DilumAluthge, this is amazing work! gemm_Float64_1_750_znver1_AVX2__multithreaded_logscale

chriselrod commented 3 years ago

The main takeaway seems to be that Tullio.jl is happier with nthreads=ncores, while Octavian can wring a bit more out of hyperthreading.

Actually, Octavian only uses up to ncores threads. BLAS kernels tend to be able to max-out a core's execution units by themselves, so they don't benefit from hyperthreading.

But they need a lot of cache, so they tend to do better the less they have to compete with for cache space -- hence, one thread per core. The block of A being multiplied aims to take up about this fraction of the L2 cache:

julia> Octavian.R₁Default
0.5900561503730485

So two of them would take up about 120%, which would cause problems ;).

Of course, if tuned to take advantage of hyperthreading, these numbers would be smaller. However, we'd really prefer it to be big: the bigger it is, the less often we have to iterate over sub-blocks of B. We have to iterate over the subblock once per A-block in A. And this is the fundamental problem with hyperthreading in BLAS: the more cache we can use, the less memory-related overhead we have.

A probably smaller issue is that for larger matrices, the different threads also have to start synchronizing. More threads means more synchronization overhead.

Because lots of code can benefit from hyperthreading, it makes sense to just limit the number of threads Octavian uses, so you don't need to penalize any other code you're running by starting Julia with less threads than that code wants.

@MasonProtter If you had doubts about upgrading to Zen3, compare the FLOPS the Zen2 laptop chip gets compared to your Zen1 desktop ;).

chriselrod commented 3 years ago

If you have hours of CPU time to burn, you could also try running this and see if it comes up with fairly different looking numbers than what it started with. Just set the time limit to how long you're willing to let it run. (Of course, very important to disable turbo and ensure consistent timing).

If you do, I can add a "if Zen1/Zen2" check to use those numbers instead. Eventually, it'd be good to improve the model it uses to decide blocking to hopefully no longer need different values for different CPUs, but for not that's unfortunately necessary.

Performance on the Ryzen 4900HS seems to start dipping a lot after 400x400 though, which is before that blocking behavior comes into play. That may be when it begins to pack A but not yet pack B -- it could probably use some tuning there as well.

In comparison, the Ryzen 5 2600's benchmark results are beautifully consistent, although they stopped at 500x500.

MasonProtter commented 3 years ago

If you had doubts about upgrading to Zen3, compare the FLOPS the Zen2 laptop chip gets compared to your Zen1 desktop ;).

Lord, I know. I want it so bad.

In comparison, the Ryzen 5 2600's benchmark results are beautifully consistent, although they stopped at 500x500.

Updated to 750 x 750 now. I accidentally killed that julia session this morning without saving the dataframe to disk, so that's the most that graph will grow for now. I'll do a run from 751:1000 tonight though. Currently running the parameter optimizer.

stillyslalom commented 3 years ago

Way ahead of you – I already got that going last night! I’m away from the computer right now, but I’ll let you know what it converged to once I’m home.

MasonProtter commented 3 years ago

I did a 4 hour run and here's the result. I'll try doing a longer run later:

julia> opt.initial_x
4-element Vector{Float64}:
 0.1
 0.15989396641218157
 0.4203583148344484
 0.6344856142604789

 julia> opt.minimizer
4-element Vector{Float64}:
 0.053918949422353986
 0.3013238122374886
 0.6077103834481342
 0.8775382433240162
chriselrod commented 3 years ago

Great, thanks. The file currently says (I'll edit it soon with the goal of making it so that you can compile Octavian into a sysimage on one computer and move it to another ans still have things work correctly):


if VectorizationBase.AVX512F
    const W₁Default = 0.006089395198610773
    const W₂Default = 0.7979822724696168
    const R₁Default = 0.5900561503730485
    const R₂Default = 0.762152930709678
else
    const W₁Default = 0.1 # TODO: relax bounds; this was the upper bound set for the optimizer.
    const W₂Default = 0.15989396641218157
    const R₁Default = 0.4203583148344484
    const R₂Default = 0.6344856142604789
end

Those values are fairly different from both of the above.

When you do restart it, be sure to start with opt.minimizer as your new initial values.

Out of curiosity, what was the difference in the objective between initial_x and opt.minimizer (i.e., by how much did the gflops improve)?

Updated to 750 x 750 now. I accidentally killed that julia session this morning without saving the dataframe to disk, so that's the most that graph will grow for now. I'll do a run from 751:1000 tonight though. Currently running the parameter optimizer.

Cool. I ran it for like 3 days on two two computers, but I think the returns diminish pretty fast. It'll never converge, and just chase noise forever. Not sure if there's a better optimization method when your objective is noisy. BlackBoxOptim doesn't let you specify initial values, so it'd take a lot longer.

mcabbott commented 3 years ago

Good work guys, some very impressive graphs here.

Octavian only uses up to ncores threads.

Are you getting this from Hwloc.jl? (And is this recommended over CpuId.jl?)

chriselrod commented 3 years ago

Good work guys, some very impressive graphs here.

The two keys to performance at small sizes are:

  1. LoopVectorization.jl
  2. ThreadingUtilities.jl

The latter does not have an API yet, but you may want to consider it for Tullio.jl, as it should let you match Octavian.jl until we get to matrices big enough to require packing.

julia> using Octavian

julia> M = K = N = 48;

julia> A = rand(M,K); B = rand(K,N); C2 = @time(A*B); C1 = similar(C2);
  0.585104 seconds (2.36 M allocations: 127.083 MiB, 20.39% gc time, 99.94% compilation time)

julia> @time(matmul!(C1,A,B)) ≈ C2
 17.015580 seconds (17.86 M allocations: 1.094 GiB, 3.00% gc time)
true

julia> @benchmark matmul!($C1,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.240 μs (0.00% GC)
  median time:      1.322 μs (0.00% GC)
  mean time:        1.325 μs (0.00% GC)
  maximum time:     6.318 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> 2e-9M*K*N / 1.24e-6 # calculate GFLOPS
178.37419354838713

julia> M = K = N = 72;

julia> A = rand(M,K); B = rand(K,N); C2 = @time(A*B); C1 = similar(C2);
  0.000462 seconds (2 allocations: 40.578 KiB)

julia> @time(matmul!(C1,A,B)) ≈ C2
  0.000051 seconds
true

julia> @benchmark matmul!($C1,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.112 μs (0.00% GC)
  median time:      2.242 μs (0.00% GC)
  mean time:        2.257 μs (0.00% GC)
  maximum time:     8.763 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> 2e-9M*K*N / 2.112e-6 # calculate GFLOPS
353.45454545454555

julia> M = K = N = 144;

julia> A = rand(M,K); B = rand(K,N); C2 = @time(A*B); C1 = similar(C2);
  0.000151 seconds (2 allocations: 162.078 KiB)

julia> @time(matmul!(C1,A,B)) ≈ C2
  0.000132 seconds
true

julia> @benchmark matmul!($C1,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.343 μs (0.00% GC)
  median time:      5.702 μs (0.00% GC)
  mean time:        5.720 μs (0.00% GC)
  maximum time:     24.244 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     6

julia> 2e-9M*K*N / 5.343e-6 # calculate GFLOPS
1117.71813587872

julia> versioninfo()
Julia Version 1.7.0-DEV.367
Commit b4547666b3* (2021-01-24 14:14 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_NUM_THREADS = 36

julia> @benchmark wait(Threads.@spawn nothing) # just spawning a thread and waiting on it
BenchmarkTools.Trial:
  memory estimate:  443 bytes
  allocs estimate:  4
  --------------
  minimum time:     1.811 μs (0.00% GC)
  median time:      7.741 μs (0.00% GC)
  mean time:        9.088 μs (0.00% GC)
  maximum time:     57.478 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

LoopVectorization handles small matrices really well and has dominated single threaded benchmarks for small matrices for quite a while. So all we need is low overhead threading to win multithreaded benchmarks. But just spawning a single task takes 9 microseconds on average, over 50% longer than it takes to multiply two 144x144 matrices which -- as you can tell from the fact it's getting over 1 teraflop -- requires more than 1 thread. @spawn is just way too slow. Allocates memory and high variance in how long it takes hurt too.

Are you getting this from Hwloc.jl?

Yes. VectorizationBase.jl gets it from Hwloc.jl. On the latest release, it defines a global constant NUM_CORES, but to get around issues of building VectorizationBase into a sysimage, it's changing to a generated function very soon:

julia> VectorizationBase.num_cores()
4

julia> @code_typed VectorizationBase.num_cores()
CodeInfo(
1 ─     return 4
) => Int64

that should still be constant, but just not compile until called (and thus hopefully avoid getting baked into a sysimage).

mcabbott commented 3 years ago

Oh nice, I should see if I can understand ThreadingUtilities.jl (at some point). From my much cruder tuning, I think I delayed multi-threading until size 70 or so, because of this @spawn overhead. But you're clearly getting a benefit long before that.

AriMKatz commented 3 years ago

Does ThreadingUtilities.jl threading lose composability with task based threaded code ?

chriselrod commented 3 years ago

Does ThreadingUtilities.jl threading lose composability with task based threaded code ?

"Yes." Note that if you call Octavian.matmul! from a thread, it will run single-threaded. This is probably what you want anyway, in the case of matrix multiply. I think more broadly, in the cases where you want low-overheaded threading, you're probably happy to not thread the code if it's nested inside other threaded code.

I put yes in quotes because ThreadingUtilities.jl doesn't implement a scheduler. Currently, it gives you the ability to run code on specific threads. So you control the behavior, and can make it do what you want.

Oh nice, I should see if I can understand ThreadingUtilities.jl (at some point). From my much cruder tuning, I think I delayed multi-threading until size 70 or so, because of this @spawn overhead. But you're clearly getting a benefit long before that.

Octavian is currently delaying multithreading until

(M*K*N < (13824 * W))

If the matrices are square and their eltype is Float64, then that is

julia> cbrt(13824 * 8)
48.0

julia> cbrt(13824 * 4)
38.097625247236785

48x48 and 39x39 for AVX512 and AVX, respectively.

It could turn on sooner than that. There is a sizeable jump in all the charts when it turns on.

stillyslalom commented 3 years ago

I ended up with the following after 8 hours:

julia> opt.minimizer
4-element Vector{Float64}:
 0.1
 0.993489411720157
 0.6052218809954467
 0.7594052633561165

julia> opt.minimum
-233.16960038079367

...compared to ~209 gflops before tuning. Strange that it's so different from @MasonProtter's optimum on close to the same architecture.

MasonProtter commented 3 years ago

Out of curiosity, what was the difference in the objective between initial_x and opt.minimizer (i.e., by how much did the gflops improve)?

@chriselrod Here you go:

julia> matmul_objective(opt.initial_x)
Params: [0.1, 0.15989396641218157, 0.4203583148344484, 0.6344856142604789]; 137.2842774319363
-137.2842774319363

julia> matmul_objective(opt.minimizer)
Params: [0.053918949422353986, 0.3013238122374886, 0.6077103834481342, 0.8775382433240162]; 141.2232512813757
-141.2232512813757

Squeezed out an extra 4 GFLOPS, not bad.

Strange that it's so different from MasonProtter's optimum on close to the same architecture.

Nah, my CPU is Zen+ which is basically Zen 1. Architecturally, it's very different from Zen 2 and Zen 3.

chriselrod commented 3 years ago
julia> opt.minimizer
4-element Vector{Float64}:
 0.1

The upper bound of 0.1 isn't a hard bound. The first two parameters are just > 0, while the last two are between 0 and 1. If you feel like running it again, you could increase the upper bound on the first parameter to 0.2, or maybe even higher.

...compared to ~209 gflops before tuning. Strange that it's so different from @MasonProtter's optimum on close to the same architecture.

That's quite the improvement, but even 233 looks worse than what we see in the graph, where it seemed to be at 240 for the larger sizes.

Strange that it's so different from MasonProtter's optimum on close to the same architecture.

The major difference between Zen1 and Zen2 is that Zen1 only had 256-bit SIMD units, and emulated 256-bit by having two 128-bit units work together. Meaning it had half the throughput for 256-bit instructions. Zen2 can do two 128 bit fma/cycle, or two 256 bit fma/cycle. Zen1 can do two 128 bit fma/cycle, or one 256 bit fma/cycle.

So in terms of tuning, that means Zen1 is bottlenecked more by execution units than by memory.

Low values of the first two numbers means than Zen1 likes larger values of K_c; it wants to spend more time in each particular microkernel call. This comes at the cost of iterating over the matrices more often, but if you're bottlenecked most by execution units, perhaps that doesn't matter.

stillyslalom commented 3 years ago

I poked through the benchmark code, and it seems to be averaging over a cache-size-dependent range of matrix sizes. Depending on how those are distributed, it's easy enough to imagine the small, relatively low-throughput matrices dragging the average down.

I restarted it from the prior optimum with the first parameter's upper bound set to 0.3. Will update tomorrow morning.

edit: optimizer seems to get hung up on local optima when using the default of 9 particles. I'll try bumping the number of particles to 4^3 so a bigger chunk of the search space gets seeded.

stillyslalom commented 3 years ago

Last night's run arrived at a very different optimum. I typo'd the initialization, so it ran with only three particles (I think?). It's hard to be too confident in these results when there's so much noise in the objective.

julia> opt2.minimizer
4-element Vector{Float64}:
 0.2559526669806126
 0.01983770991788305
 0.8217366097415788
 0.7832307773033691

julia> opt2.minimum
-228.20806080690062