Closed ChrisRackauckas closed 10 months ago
OpenBLAS looks drunk.
openBLAS is just multithreading way too small. RLFU looks like a good option. Is the (pre)compile time impact reasonable?
A Mac version:
using BenchmarkTools, Random, VectorizationBase
using LinearAlgebra, LinearSolve
nc = min(Int(VectorizationBase.num_cores()), Threads.nthreads())
BLAS.set_num_threads(nc)
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 0.5
function luflop(m, n = m; innerflop = 2)
sum(1:min(m, n)) do k
invflop = 1
scaleflop = isempty((k + 1):m) ? 0 : sum((k + 1):m)
updateflop = isempty((k + 1):n) ? 0 :
sum((k + 1):n) do j
isempty((k + 1):m) ? 0 : sum((k + 1):m) do i
innerflop
end
end
invflop + scaleflop + updateflop
end
end
algs = [LUFactorization(), GenericLUFactorization(), RFLUFactorization(), AppleAccelerateLUFactorization(), FastLUFactorization(), SimpleLUFactorization()]
res = [Float64[] for i in 1:length(algs)]
ns = 4:8:500
for i in 1:length(ns)
n = ns[i]
@info "$n × $n"
rng = MersenneTwister(123)
global A = rand(rng, n, n)
global b = rand(rng, n)
global u0= rand(rng, n)
for j in 1:length(algs)
bt = @belapsed solve(prob, $(algs[j])).u setup=(prob = LinearProblem(copy(A), copy(b); u0 = copy(u0), alias_A=true, alias_b=true))
push!(res[j], luflop(n) / bt / 1e9)
end
end
using Plots
__parameterless_type(T) = Base.typename(T).wrapper
parameterless_type(x) = __parameterless_type(typeof(x))
parameterless_type(::Type{T}) where {T} = __parameterless_type(T)
p = plot(ns, res[1]; ylabel = "GFLOPs", xlabel = "N", title = "GFLOPs for NxN LU Factorization", label = string(Symbol(parameterless_type(algs[1]))), legend=:outertopright)
for i in 2:length(res)
plot!(p, ns, res[i]; label = string(Symbol(parameterless_type(algs[i]))))
end
p
savefig("lubench.png")
savefig("lubench.pdf")
julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin22.4.0)
CPU: 12 × Apple M2 Max
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
Threads: 8 on 8 virtual cores
It's interesting that RFLU does poorly on mac. Does it not know about the lower vectorization width?
Ideally you should have a tune
or plan
API, which can do it on a user's system. Applications that care about it can opt into a tuning run and save these preferences. If something like a switch-able BLAS package gets done, this sort of tuning can be even easier.
Running the Mac specific version from [above comment]()
julia +beta --startup-file=no --proj
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.10.0-beta1 (2023-07-25)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
(tmp) pkg> st
Status `~/.julia/dev/tmp/Project.toml`
[13e28ba4] AppleAccelerate v0.4.0
[6e4b80f9] BenchmarkTools v1.3.2
[7ed4a6bd] LinearSolve v2.5.0
[dde4c033] Metal v0.5.0
[91a5bcdd] Plots v1.38.17
[3d5dd08c] VectorizationBase v0.21.64
julia> versioninfo()
Julia Version 1.10.0-beta1
Commit 6616549950e (2023-07-25 17:43 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin22.4.0)
CPU: 8 × Apple M2
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
Threads: 1 on 4 virtual cores
Environment:
JULIA_NUM_PRECOMPILE_TASKS = 8
JULIA_DEPOT_PATH = /Users/vp/.julia
JULIA_PKG_DEVDIR = /Users/vp/.julia/dev
Julia Version 1.11.0-DEV.235
Commit 9f9e989f24 (2023-08-06 04:35 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake)
Threads: 6 on 8 virtual cores
Environment:
JULIA_NUM_THREADS = 4,1
Looks like RFLU is really good for small sizes but isn't doing that well once all your threads have non-trivial sized problems.
I wonder if that's a trend of Intel vs AMD. Need more data.
Here is:
julia> versioninfo()
Julia Version 1.9.1
Commit 147bdf428c (2023-06-07 08:27 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, tigerlake)
Threads: 1 on 8 virtual cores
Environment:
JULIA_IMAGE_THREADS = 1
Might make sense to remove the two lowest performing methods here to speed up benchmarks given they probably won't be chosen?
yeah maybe though it doesn't change the plot much and confirms at what point the BLAS matters
I wonder if that's a trend of Intel vs AMD. Need more data.
I'm pretty sure it's number of threads. Openblas generally is pretty bad at ramping up the number of threads as size increases and often just goes straight from 1 to full multi-threading. As such on CPUs with lots of cores it performs incredibly badly in the region where it should be using 2-4 cores and is instead using 16.
But that gives no explanation to what was actually mentioned, which has no OpenBLAS but is RecursiveFactorization vs MKL and where the cutoff is. From the looks so far, I'd say:
and never doing OpenBLAS.
When I tried to run this the code errored on the 348 x 348 problem size (twice). Not sure what's up with that since I made a clean tmp environment. Could just be me but thought I'd share.
MethodError: no method matching add_bisecting_if_branches!(::Expr, ::Int64, ::Int64, ::Int64, ::Bool)
The applicable method may be too new: running in world age 35110, while current world is 37210.
Julia Version 1.9.1
Commit 147bdf428c (2023-06-07 08:27 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: 80 × Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
Threads: 40 on 80 virtual cores
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 × AMD Ryzen 7 7800X3D 8-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
Threads: 16 on 16 virtual cores
Environment:
JULIA_IMAGE_THREADS = 1
that's interesting. There are now two of us that have a fairly noticeable performance bump in the 350 to 380 region for RFLU. Any ideas as to what could be causing it? It looks like we aren't using more threads soon enough maybe?
Considering mine crashes in the RFLU at 350, something definitely going on there.
@chriselrod
Multithreaded and singlethreaded are going to look very different.
MKL does very well multithreaded, while OpenBLAS is awful (as has been discussed).
RF does not scale well with multiple threads. That is a known issue, but Yingbo and I never had time to address it.
@ejmeitz, you aren't doing something weird like not using precompiled modules, are you?
When I added the packages it spammed the message below. I started from a clean env so I just kind of ignored the messages, but that is probably the issue. I did run precompile
after adding all the packages just to be sure though.
┌ Warning: Module VectorizationBase with build ID fafbfcfd-2196-d9ff-0000-9410f7322d5a is missing from the cache.
│ This may mean VectorizationBase [3d5dd08c-fd9d-11e8-17fa-ed2836048c2f] does not support precompilation but is imported by a module that does.
Nuke your precompile cache and try again.
$ rm -rf ~/.julia/compiled/
That fixed it, thanks! RFLU seems better on my machine for longer than some of the others.
Julia Version 1.9.1
Commit 147bdf428c (2023-06-07 08:27 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: 80 × Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
Threads: 40 on 80 virtual cores
julia> versioninfo()
Julia Version 1.11.0-DEV.238
Commit 8b8da91ad7 (2023-08-08 01:11 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
Threads: 53 on 36 virtual cores
@ejmeitz your GFLOPs are generally very low, getting stomped by much cheaper CPUs. Maybe it's the clock rate?
Likely, because of the poor multithreaded scaling.
The big winner here is the Ryzen 7800X3D. It probably wants more multithreading to kick in at a much smaller size.
I noticed that too. I also thought it would be the clocks (max turbo is 4 GHz) but it still felt low to me. Probably a combo of poor multithread scaling and the clocks being low. I can run it on 128 core AMD CPU if you'd be curious to see that data.
Definitely curious now.
I have run (sorry for the delay, had to install/upgrade some dependencies):
Julia Version 1.10.0-alpha1
Commit f8ad15f7b16 (2023-07-06 10:36 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin22.4.0)
CPU: 8 × Apple M1 Pro
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
Threads: 5 on 6 virtual cores
Environment:
LD_LIBRARY_PATH =
JULIA_NUM_THREADS = 4
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 × AMD Ryzen 7 PRO 6850U with Radeon Graphics
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
Threads: 16 on 16 virtual cores
This is instead what I get running julia with -t 1
Note that you would have to separately set OpenBLAS number of threads and MKL number of threads to 1.
julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 64 × AMD EPYC 7513 32-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
Threads: 64 on 64 virtual cores
Same architecture as @ChrisRackauckas's 5950X, but a lot of threads with low clocks means bad performance, like @ejmeitz's Xeon(R) Gold 5218R.
EDIT: Actually, around the 200x200 mark, it isn't that much behind the 5950x. Still, this isn't a benchmark that really seems to benefit from more than a few threads.
The butterfly lu Yingbo and I worked on like a year ago should be easier to parallelize, but we'd need to finish it.
On a last-gen Xeon workstation:
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 20 × Intel(R) Xeon(R) W-2255 CPU @ 3.70GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
Threads: 20 on 20 virtual cores
I'll try this on my Zen 4 machine in a few hours.
I think this only ran on 1 CPU (64 cores) instead of the full 128 cores (2 CPUs) but I saw similar scaling to the 40 Core Xeon and @chriselrod results. Vectorization base only saw 64 cores from the one CPU.
Julia Version 1.9.1
Commit 147bdf428c (2023-06-07 08:27 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 128 × AMD EPYC 7713 64-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
Threads: 128 on 128 virtual cores
Environment:
JULIA_NUM_THREADS = 128
(threads should probably be 64 if I only got 1 CPU)
julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin22.4.0)
CPU: 8 × Apple M1 Pro
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
Threads: 6 on 6 virtual cores
see also https://github.com/JuliaPackaging/Yggdrasil/pull/7189 which should make the OpenBlas based ones perform slightly less badly. @ViralBShah pointed out to me that OpenBlas's cutoff for when to start threading was set 10 years ago and not tweaked since and AVX2+FMA have significantly changed the optimal size for it.
julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 × AMD Ryzen 7 PRO 6850U with Radeon Graphics
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
Threads: 8 on 16 virtual cores
Environment:
JULIA_NUM_THREADS = 8
From a Zen4:
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 24 × AMD Ryzen 9 7900 12-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
Threads: 24 on 24 virtual cores
(also, the color scheme on these plots isn't really friendly for those of us who are partially colorblind, too many close reds and blues on the legend).
zen2 laptop (2 years old)
julia> versioninfo() Julia Version 1.9.2 Commit e4ee485e90 (2023-07-05 09:39 UTC) Platform Info: OS: Linux (x86_64-linux-gnu) CPU: 16 × AMD Ryzen 7 4800H with Radeon Graphics WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-14.0.6 (ORCJIT, znver2) Threads: 16 on 16 virtual cores Environment: LD_LIBRARY_PATH = :/usr/local/cuda-12.1/lib64:/home/joe/local/lib:/home/joe/local/lib64:/usr/local/cuda-12.1/lib64:/home/joe/local/lib:/home/joe/local/lib64 JULIA_HOME = /home/joe/local
The SciMLBenchmarks server wanted to participate:
https://docs.sciml.ai/SciMLBenchmarksOutput/dev/LinearSolve/LUFactorization/
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 128 × AMD EPYC 7502 32-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
Threads: 128 on 128 virtual cores
Environment:
JULIA_CPU_THREADS = 128
JULIA_DEPOT_PATH = /cache/julia-buildkite-plugin/depots/5b300254-1738-4989-ae0a-f4d2d937f953
JULIA_IMAGE_THREADS = 1
I was hoping that these benchmarks would show that we should drop RecursiveFactorization from the defaults now that we have MKL, but they just don't show that. I think the scheme I'd go with is:
But that tuning would only be for high even performance. Because RecursiveFactorization is still in the picture, we cannot cut out its precompilation. But the results are very clear that we consistently want it for the low end.
RFLU's threading heuristics are quite bad.
Someone should see if we can use StaticCompiler.jl
and BinaryBuilder to compile a shared library for each architecture feature level (i.e., for x64, we'd want generic, AVX, AVX+FMA, and AVX512).
Then, we can load the appropriate one and ccall
it. Users get shipped binaries and don't have to deal with precompiling it, or even directly depending on the LoopVectorization.jl
stack at all.
You'd have to compile it in single threaded-only mode, but I think that is for the best anyway. MKL does threading so much better, so the point where we want multiple threads is the point where we should be switching to MKL.
Someone should see if we can use StaticCompiler.jl and BinaryBuilder to compile a shared library for each architecture feature level (i.e., for x64, we'd want generic, AVX, AVX+FMA, and AVX512). Then, we can load the appropriate one and ccall it. Users get shipped binaries and don't have to deal with precompiling it, or even directly depending on the LoopVectorization.jl stack at all.
That would be an interesting GSoC
do we know how MKL is doing threading? It should be easy enough to observe (e.g. by watching htop).
Mac with [1,4,8] threads:
The single-threaded case for real-time applications stacks up nicely. A potential lockless multi-threaded version might not be worth the overhead. Also interested to see if Metal GPU is beneficial for real-time.
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin22.4.0)
CPU: 10 × Apple M1 Max
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
Threads: 1 on 8 virtual cores
Environment:
JULIA_IMAGE_THREADS = 1
see also https://github.com/JuliaPackaging/Yggdrasil/pull/7189 which should make the OpenBlas based ones perform slightly less badly. @ViralBShah pointed out to me that OpenBlas's cutoff for when to start threading was set 10 years ago and not tweaked since and AVX2+FMA have significantly changed the optimal size for it.
Julia master now has a much better GEMM threading threshold: https://github.com/JuliaLang/julia/pull/50844
Using single julia thread
julia> versioninfo()
Julia Version 1.10.0-beta1
Commit 6616549950e (2023-07-25 17:43 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 8 × Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 1 on 8 virtual cores
Using julia -t auto
julia> versioninfo()
Julia Version 1.10.0-beta1
Commit 6616549950e (2023-07-25 17:43 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 8 × Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 11 on 8 virtual cores
single thread
julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 8 × Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
Threads: 1 on 8 virtual cores
julia -t auto
julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 8 × Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
Threads: 8 on 8 virtual cores
@ChrisRackauckas Where can I find out what each of the factorizations does?
I think we may want
@static if Sys.ARCH === :x86_64
const libMKL = MKL_jll.libmkl_rt # more convenient name
function mkl_set_num_threads(N::Integer)
ccall((:MKL_Set_Num_Threads,libMKL), Cvoid, (Int32,), N % Int32)
end
mkl_set_num_threads(Threads.nthreads())
end
for single threaded comparisons.
Single threaded performance where we actually restrict the solves to a single thread would likely be useful for ensemble solves.
This is a thread for investigating changes to the LU defaults, based off of benchmarks like https://github.com/SciML/LinearSolve.jl/pull/356 .
(Note: there's a Mac-specific version 3 posts down)
lubench.pdf
The justification for RecursiveFactorization.jl still looks very strong from the looks of this.
Needs examples on other systems.