SciML / LinearSolve.jl

LinearSolve.jl: High-Performance Unified Interface for Linear Solvers in Julia. Easily switch between factorization and Krylov methods, add preconditioners, and all in one interface.
https://docs.sciml.ai/LinearSolve/stable/
Other
244 stars 52 forks source link

Changing the default LU? #357

Closed ChrisRackauckas closed 10 months ago

ChrisRackauckas commented 1 year ago

This is a thread for investigating changes to the LU defaults, based off of benchmarks like https://github.com/SciML/LinearSolve.jl/pull/356 .

(Note: there's a Mac-specific version 3 posts down)

using BenchmarkTools, Random, VectorizationBase
using LinearAlgebra, LinearSolve, MKL_jll
nc = min(Int(VectorizationBase.num_cores()), Threads.nthreads())
BLAS.set_num_threads(nc)
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 0.5

function luflop(m, n = m; innerflop = 2)
    sum(1:min(m, n)) do k
        invflop = 1
        scaleflop = isempty((k + 1):m) ? 0 : sum((k + 1):m)
        updateflop = isempty((k + 1):n) ? 0 :
                     sum((k + 1):n) do j
            isempty((k + 1):m) ? 0 : sum((k + 1):m) do i
                innerflop
            end
        end
        invflop + scaleflop + updateflop
    end
end

algs = [LUFactorization(), GenericLUFactorization(), RFLUFactorization(), MKLLUFactorization(), FastLUFactorization(), SimpleLUFactorization()]
res = [Float64[] for i in 1:length(algs)]

ns = 4:8:500
for i in 1:length(ns)
    n = ns[i]
    @info "$n × $n"
    rng = MersenneTwister(123)
    global A = rand(rng, n, n)
    global b = rand(rng, n)
    global u0= rand(rng, n)

    for j in 1:length(algs)
        bt = @belapsed solve(prob, $(algs[j])).u setup=(prob = LinearProblem(copy(A), copy(b); u0 = copy(u0), alias_A=true, alias_b=true))
        push!(res[j], luflop(n) / bt / 1e9)
    end
end

using Plots
__parameterless_type(T) = Base.typename(T).wrapper
parameterless_type(x) = __parameterless_type(typeof(x))
parameterless_type(::Type{T}) where {T} = __parameterless_type(T)

p = plot(ns, res[1]; ylabel = "GFLOPs", xlabel = "N", title = "GFLOPs for NxN LU Factorization", label = string(Symbol(parameterless_type(algs[1]))), legend=:outertopright)
for i in 2:length(res)
    plot!(p, ns, res[i]; label = string(Symbol(parameterless_type(algs[i]))))
end
p

savefig("lubench.png")
savefig("lubench.pdf")

lubench.pdf lubench

The justification for RecursiveFactorization.jl still looks very strong from the looks of this.

julia> versioninfo()
Julia Version 1.9.1
Commit 147bdf428c (2023-06-07 08:27 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 32 × AMD Ryzen 9 5950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 32 on 32 virtual cores
Environment:
  JULIA_IMAGE_THREADS = 1
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 32

Needs examples on other systems.

ChrisRackauckas commented 1 year ago

OpenBLAS looks drunk.

oscardssmith commented 1 year ago

openBLAS is just multithreading way too small. RLFU looks like a good option. Is the (pre)compile time impact reasonable?

ChrisRackauckas commented 1 year ago

A Mac version:

using BenchmarkTools, Random, VectorizationBase
using LinearAlgebra, LinearSolve
nc = min(Int(VectorizationBase.num_cores()), Threads.nthreads())
BLAS.set_num_threads(nc)
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 0.5

function luflop(m, n = m; innerflop = 2)
    sum(1:min(m, n)) do k
        invflop = 1
        scaleflop = isempty((k + 1):m) ? 0 : sum((k + 1):m)
        updateflop = isempty((k + 1):n) ? 0 :
                     sum((k + 1):n) do j
            isempty((k + 1):m) ? 0 : sum((k + 1):m) do i
                innerflop
            end
        end
        invflop + scaleflop + updateflop
    end
end

algs = [LUFactorization(), GenericLUFactorization(), RFLUFactorization(), AppleAccelerateLUFactorization(), FastLUFactorization(), SimpleLUFactorization()]
res = [Float64[] for i in 1:length(algs)]

ns = 4:8:500
for i in 1:length(ns)
    n = ns[i]
    @info "$n × $n"
    rng = MersenneTwister(123)
    global A = rand(rng, n, n)
    global b = rand(rng, n)
    global u0= rand(rng, n)

    for j in 1:length(algs)
        bt = @belapsed solve(prob, $(algs[j])).u setup=(prob = LinearProblem(copy(A), copy(b); u0 = copy(u0), alias_A=true, alias_b=true))
        push!(res[j], luflop(n) / bt / 1e9)
    end
end

using Plots
__parameterless_type(T) = Base.typename(T).wrapper
parameterless_type(x) = __parameterless_type(typeof(x))
parameterless_type(::Type{T}) where {T} = __parameterless_type(T)

p = plot(ns, res[1]; ylabel = "GFLOPs", xlabel = "N", title = "GFLOPs for NxN LU Factorization", label = string(Symbol(parameterless_type(algs[1]))), legend=:outertopright)
for i in 2:length(res)
    plot!(p, ns, res[i]; label = string(Symbol(parameterless_type(algs[i]))))
end
p

savefig("lubench.png")
savefig("lubench.pdf")

lubench.pdf lubench

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 8 on 8 virtual cores
oscardssmith commented 1 year ago

It's interesting that RFLU does poorly on mac. Does it not know about the lower vectorization width?

ViralBShah commented 1 year ago

Ideally you should have a tune or plan API, which can do it on a user's system. Applications that care about it can opt into a tuning run and save these preferences. If something like a switch-able BLAS package gets done, this sort of tuning can be even easier.

vpuri3 commented 1 year ago

Running the Mac specific version from [above comment]()

 julia +beta --startup-file=no --proj
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.10.0-beta1 (2023-07-25)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

(tmp) pkg> st
Status `~/.julia/dev/tmp/Project.toml`
  [13e28ba4] AppleAccelerate v0.4.0
  [6e4b80f9] BenchmarkTools v1.3.2
  [7ed4a6bd] LinearSolve v2.5.0
  [dde4c033] Metal v0.5.0
  [91a5bcdd] Plots v1.38.17
  [3d5dd08c] VectorizationBase v0.21.64

julia> versioninfo()
Julia Version 1.10.0-beta1
Commit 6616549950e (2023-07-25 17:43 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 8 × Apple M2
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores
Environment:
  JULIA_NUM_PRECOMPILE_TASKS = 8
  JULIA_DEPOT_PATH = /Users/vp/.julia
  JULIA_PKG_DEVDIR = /Users/vp/.julia/dev

lubench.pdf

oscardssmith commented 1 year ago
Julia Version 1.11.0-DEV.235
Commit 9f9e989f24 (2023-08-06 04:35 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake)
  Threads: 6 on 8 virtual cores
Environment:
  JULIA_NUM_THREADS = 4,1

image Looks like RFLU is really good for small sizes but isn't doing that well once all your threads have non-trivial sized problems.

ChrisRackauckas commented 1 year ago

I wonder if that's a trend of Intel vs AMD. Need more data.

nilshg commented 1 year ago

Here is:

julia> versioninfo()
Julia Version 1.9.1
Commit 147bdf428c (2023-06-07 08:27 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, tigerlake)
  Threads: 1 on 8 virtual cores
Environment:
  JULIA_IMAGE_THREADS = 1

image

Might make sense to remove the two lowest performing methods here to speed up benchmarks given they probably won't be chosen?

ChrisRackauckas commented 1 year ago

yeah maybe though it doesn't change the plot much and confirms at what point the BLAS matters

oscardssmith commented 1 year ago

I wonder if that's a trend of Intel vs AMD. Need more data.

I'm pretty sure it's number of threads. Openblas generally is pretty bad at ramping up the number of threads as size increases and often just goes straight from 1 to full multi-threading. As such on CPUs with lots of cores it performs incredibly badly in the region where it should be using 2-4 cores and is instead using 16.

ChrisRackauckas commented 1 year ago

But that gives no explanation to what was actually mentioned, which has no OpenBLAS but is RecursiveFactorization vs MKL and where the cutoff is. From the looks so far, I'd say:

  1. On Mac, default to always using accelerate
  2. On Intel, default to RecursiveFactorization cutoff at n=150 switch to MKL
  3. On AMD, default to RecursiveFactorization no cutoff

and never doing OpenBLAS.

ejmeitz commented 1 year ago

When I tried to run this the code errored on the 348 x 348 problem size (twice). Not sure what's up with that since I made a clean tmp environment. Could just be me but thought I'd share.

MethodError: no method matching add_bisecting_if_branches!(::Expr, ::Int64, ::Int64, ::Int64, ::Bool)
The applicable method may be too new: running in world age 35110, while current world is 37210.
Julia Version 1.9.1
Commit 147bdf428c (2023-06-07 08:27 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 80 × Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
  Threads: 40 on 80 virtual cores
DaniGlez commented 1 year ago
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 7800X3D 8-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 16 on 16 virtual cores
Environment:
  JULIA_IMAGE_THREADS = 1

image

oscardssmith commented 1 year ago

that's interesting. There are now two of us that have a fairly noticeable performance bump in the 350 to 380 region for RFLU. Any ideas as to what could be causing it? It looks like we aren't using more threads soon enough maybe?

ejmeitz commented 1 year ago

Considering mine crashes in the RFLU at 350, something definitely going on there.

ChrisRackauckas commented 1 year ago

@chriselrod

chriselrod commented 1 year ago

Multithreaded and singlethreaded are going to look very different.

MKL does very well multithreaded, while OpenBLAS is awful (as has been discussed).

RF does not scale well with multiple threads. That is a known issue, but Yingbo and I never had time to address it.

chriselrod commented 1 year ago

@ejmeitz, you aren't doing something weird like not using precompiled modules, are you?

ejmeitz commented 1 year ago

When I added the packages it spammed the message below. I started from a clean env so I just kind of ignored the messages, but that is probably the issue. I did run precompile after adding all the packages just to be sure though.

┌ Warning: Module VectorizationBase with build ID fafbfcfd-2196-d9ff-0000-9410f7322d5a is missing from the cache.
│ This may mean VectorizationBase [3d5dd08c-fd9d-11e8-17fa-ed2836048c2f] does not support precompilation but is imported by a module that does.
chriselrod commented 1 year ago

Nuke your precompile cache and try again.

$ rm -rf ~/.julia/compiled/
ejmeitz commented 1 year ago

That fixed it, thanks! RFLU seems better on my machine for longer than some of the others.

lubench

Julia Version 1.9.1
Commit 147bdf428c (2023-06-07 08:27 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 80 × Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
  Threads: 40 on 80 virtual cores
chriselrod commented 1 year ago

lubench

julia> versioninfo()
Julia Version 1.11.0-DEV.238
Commit 8b8da91ad7 (2023-08-08 01:11 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
  Threads: 53 on 36 virtual cores
ChrisRackauckas commented 1 year ago

@ejmeitz your GFLOPs are generally very low, getting stomped by much cheaper CPUs. Maybe it's the clock rate?

chriselrod commented 1 year ago

Likely, because of the poor multithreaded scaling.

The big winner here is the Ryzen 7800X3D. It probably wants more multithreading to kick in at a much smaller size.

ejmeitz commented 1 year ago

I noticed that too. I also thought it would be the clocks (max turbo is 4 GHz) but it still felt low to me. Probably a combo of poor multithread scaling and the clocks being low. I can run it on 128 core AMD CPU if you'd be curious to see that data.

ChrisRackauckas commented 1 year ago

Definitely curious now.

Leticia-maria commented 1 year ago

I have run (sorry for the delay, had to install/upgrade some dependencies):

Julia Version 1.10.0-alpha1
Commit f8ad15f7b16 (2023-07-06 10:36 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 8 × Apple M1 Pro
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 5 on 6 virtual cores
Environment:
  LD_LIBRARY_PATH = 
  JULIA_NUM_THREADS = 4
Screenshot 2023-08-08 at 10 59 55 AM
mastrof commented 1 year ago
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 PRO 6850U with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 16 on 16 virtual cores

lubench

This is instead what I get running julia with -t 1 lubench_1

chriselrod commented 1 year ago

Note that you would have to separately set OpenBLAS number of threads and MKL number of threads to 1.

chriselrod commented 1 year ago

deepsealubench

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD EPYC 7513 32-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 64 on 64 virtual cores

Same architecture as @ChrisRackauckas's 5950X, but a lot of threads with low clocks means bad performance, like @ejmeitz's Xeon(R) Gold 5218R.

EDIT: Actually, around the 200x200 mark, it isn't that much behind the 5950x. Still, this isn't a benchmark that really seems to benefit from more than a few threads.

The butterfly lu Yingbo and I worked on like a year ago should be easier to parallelize, but we'd need to finish it.

imciner2 commented 1 year ago

lubench_Xeon20thread lubench_Xeon20thread.pdf

On a last-gen Xeon workstation:

Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 20 × Intel(R) Xeon(R) W-2255 CPU @ 3.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
  Threads: 20 on 20 virtual cores

I'll try this on my Zen 4 machine in a few hours.

ejmeitz commented 1 year ago

I think this only ran on 1 CPU (64 cores) instead of the full 128 cores (2 CPUs) but I saw similar scaling to the 40 Core Xeon and @chriselrod results. Vectorization base only saw 64 cores from the one CPU.

lubench_trace

Julia Version 1.9.1
Commit 147bdf428c (2023-06-07 08:27 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 128 × AMD EPYC 7713 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 128 on 128 virtual cores
Environment:
  JULIA_NUM_THREADS = 128

(threads should probably be 64 if I only got 1 CPU)

rmsrosa commented 1 year ago

lubench

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 8 × Apple M1 Pro
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 6 on 6 virtual cores
oscardssmith commented 1 year ago

see also https://github.com/JuliaPackaging/Yggdrasil/pull/7189 which should make the OpenBlas based ones perform slightly less badly. @ViralBShah pointed out to me that OpenBlas's cutoff for when to start threading was set 10 years ago and not tweaked since and AVX2+FMA have significantly changed the optimal size for it.

nkoch1 commented 1 year ago
julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 PRO 6850U with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 8 on 16 virtual cores
Environment:
  JULIA_NUM_THREADS = 8

lubench

imciner2 commented 1 year ago

lubench lubench_Zen424thread.pdf

From a Zen4:

Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 7900 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 24 on 24 virtual cores

(also, the color scheme on these plots isn't really friendly for those of us who are partially colorblind, too many close reds and blues on the legend).

joelandman commented 1 year ago

lubench zen2 laptop (2 years old)

julia> versioninfo() Julia Version 1.9.2 Commit e4ee485e90 (2023-07-05 09:39 UTC) Platform Info: OS: Linux (x86_64-linux-gnu) CPU: 16 × AMD Ryzen 7 4800H with Radeon Graphics WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-14.0.6 (ORCJIT, znver2) Threads: 16 on 16 virtual cores Environment: LD_LIBRARY_PATH = :/usr/local/cuda-12.1/lib64:/home/joe/local/lib:/home/joe/local/lib64:/usr/local/cuda-12.1/lib64:/home/joe/local/lib:/home/joe/local/lib64 JULIA_HOME = /home/joe/local

lubench.pdf

ChrisRackauckas commented 1 year ago

The SciMLBenchmarks server wanted to participate:

https://docs.sciml.ai/SciMLBenchmarksOutput/dev/LinearSolve/LUFactorization/

image

Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 128 × AMD EPYC 7502 32-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
  Threads: 128 on 128 virtual cores
Environment:
  JULIA_CPU_THREADS = 128
  JULIA_DEPOT_PATH = /cache/julia-buildkite-plugin/depots/5b300254-1738-4989-ae0a-f4d2d937f953
  JULIA_IMAGE_THREADS = 1
ChrisRackauckas commented 1 year ago

I was hoping that these benchmarks would show that we should drop RecursiveFactorization from the defaults now that we have MKL, but they just don't show that. I think the scheme I'd go with is:

But that tuning would only be for high even performance. Because RecursiveFactorization is still in the picture, we cannot cut out its precompilation. But the results are very clear that we consistently want it for the low end.

chriselrod commented 1 year ago

RFLU's threading heuristics are quite bad.

chriselrod commented 1 year ago

Someone should see if we can use StaticCompiler.jl and BinaryBuilder to compile a shared library for each architecture feature level (i.e., for x64, we'd want generic, AVX, AVX+FMA, and AVX512). Then, we can load the appropriate one and ccall it. Users get shipped binaries and don't have to deal with precompiling it, or even directly depending on the LoopVectorization.jl stack at all.

You'd have to compile it in single threaded-only mode, but I think that is for the best anyway. MKL does threading so much better, so the point where we want multiple threads is the point where we should be switching to MKL.

ChrisRackauckas commented 1 year ago

Someone should see if we can use StaticCompiler.jl and BinaryBuilder to compile a shared library for each architecture feature level (i.e., for x64, we'd want generic, AVX, AVX+FMA, and AVX512). Then, we can load the appropriate one and ccall it. Users get shipped binaries and don't have to deal with precompiling it, or even directly depending on the LoopVectorization.jl stack at all.

That would be an interesting GSoC

oscardssmith commented 1 year ago

do we know how MKL is doing threading? It should be easy enough to observe (e.g. by watching htop).

ggkountouras commented 1 year ago

Mac with [1,4,8] threads:

lubench_1_thread lubench_4_threads lubench_8_threads

The single-threaded case for real-time applications stacks up nicely. A potential lockless multi-threaded version might not be worth the overhead. Also interested to see if Metal GPU is beneficial for real-time.

Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 1 on 8 virtual cores
Environment:
  JULIA_IMAGE_THREADS = 1
ViralBShah commented 1 year ago

see also https://github.com/JuliaPackaging/Yggdrasil/pull/7189 which should make the OpenBlas based ones perform slightly less badly. @ViralBShah pointed out to me that OpenBlas's cutoff for when to start threading was set 10 years ago and not tweaked since and AVX2+FMA have significantly changed the optimal size for it.

Julia master now has a much better GEMM threading threshold: https://github.com/JuliaLang/julia/pull/50844

albheim commented 1 year ago

Using single julia thread

julia> versioninfo()
Julia Version 1.10.0-beta1
Commit 6616549950e (2023-07-25 17:43 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores

lubench-1

Using julia -t auto

julia> versioninfo()
Julia Version 1.10.0-beta1
Commit 6616549950e (2023-07-25 17:43 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
  Threads: 11 on 8 virtual cores

lubench

lgoettgens commented 1 year ago

single thread

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores

lubench1

julia -t auto

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 8 on 8 virtual cores

lubench

ViralBShah commented 1 year ago

@ChrisRackauckas Where can I find out what each of the factorizations does?

chriselrod commented 1 year ago

I think we may want

@static if Sys.ARCH === :x86_64
const libMKL = MKL_jll.libmkl_rt # more convenient name
function mkl_set_num_threads(N::Integer)
    ccall((:MKL_Set_Num_Threads,libMKL), Cvoid, (Int32,), N % Int32)
end
mkl_set_num_threads(Threads.nthreads())
end

for single threaded comparisons.

Single threaded performance where we actually restrict the solves to a single thread would likely be useful for ensemble solves.