JuliaLinearAlgebra / Octavian.jl

Multi-threaded BLAS-like library that provides pure Julia matrix multiplication
https://julialinearalgebra.github.io/Octavian.jl/stable/
Other
226 stars 18 forks source link

M1 #185

Closed chriselrod closed 11 months ago

chriselrod commented 11 months ago

@JohannesNaegele, this should improve things and fix the tilesearch.

codecov[bot] commented 11 months ago

Codecov Report

Attention: 13 lines in your changes are missing coverage. Please review.

Comparison is base (8bfdecb) 89.90% compared to head (be5ca9e) 89.63%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #185 +/- ## ========================================== - Coverage 89.90% 89.63% -0.28% ========================================== Files 14 14 Lines 1139 1138 -1 ========================================== - Hits 1024 1020 -4 - Misses 115 118 +3 ``` | [Files](https://app.codecov.io/gh/JuliaLinearAlgebra/Octavian.jl/pull/185?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaLinearAlgebra) | Coverage Δ | | |---|---|---| | [src/matmul.jl](https://app.codecov.io/gh/JuliaLinearAlgebra/Octavian.jl/pull/185?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaLinearAlgebra#diff-c3JjL21hdG11bC5qbA==) | `92.50% <100.00%> (-0.28%)` | :arrow_down: | | [src/funcptrs.jl](https://app.codecov.io/gh/JuliaLinearAlgebra/Octavian.jl/pull/185?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaLinearAlgebra#diff-c3JjL2Z1bmNwdHJzLmps) | `97.29% <85.71%> (-2.71%)` | :arrow_down: | | [src/global\_constants.jl](https://app.codecov.io/gh/JuliaLinearAlgebra/Octavian.jl/pull/185?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaLinearAlgebra#diff-c3JjL2dsb2JhbF9jb25zdGFudHMuamw=) | `72.22% <57.14%> (ø)` | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

JohannesNaegele commented 11 months ago

Sadly, it doesn't improve on the prior version:

(...) pkg> status Octavian
Status `~/.../Project.toml`
  [6fd5a793] Octavian v0.3.26 `https://github.com/JuliaLinearAlgebra/Octavian.jl#m1`

julia> Octavian.block_sizes(Val(Float64))
(static(400), static(586), static(174))

julia> @btime Octavian.matmul!($C, $A, $B) # previously around or less than 340 ms
  353.531 ms (0 allocations: 0 bytes)

Note that I am on a M2 core though.

chriselrod commented 11 months ago

Hmm, with current settings I have

julia> Octavian.block_sizes(Val(Float64))
(static(240), static(818), static(13878))

julia> A = rand(2000,2000);

julia> B = rand(2000,2000);

julia> C = similar(A);

julia> @time matmul!(C,A,B);
  4.674307 seconds (12.73 M allocations: 714.998 MiB, 2.55% gc time, 92.70% compilation time)

julia> @time matmul!(C,A,B);
  0.340958 seconds

julia> @benchmark matmul!($C,$A,$B)
BenchmarkTools.Trial: 3 samples with 1 evaluation.
 Range (min … max):  340.878 ms … 340.911 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     340.882 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   340.890 ms ±  18.047 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █     █                                                     █  
  █▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  341 ms           Histogram: frequency by time          341 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

So 340ms on my M1, which should be slower than your M2.

It still loses to OpenBLAS

julia> using LinearAlgebra; BLAS.set_num_threads(1);

julia> @benchmark mul!($C,$A,$B)
BenchmarkTools.Trial: 3 samples with 1 evaluation.
 Range (min … max):  332.992 ms … 336.010 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     335.081 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   334.694 ms ±   1.546 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                         █                 █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  333 ms           Histogram: frequency by time          336 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

but it isn't too far off.

Octavian wins at that size for me when using 4 threads:

julia> @benchmark matmul!($C,$A,$B)

BenchmarkTools.Trial: 11 samples with 1 evaluation.
 Range (min … max):  93.273 ms …  93.667 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     93.507 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   93.502 ms ± 122.899 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █             █     █      ██      █         █ █ █     █   █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁██▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁█▁█▁▁▁▁▁█▁▁▁█ ▁
  93.3 ms         Histogram: frequency by time         93.7 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> 

julia> @benchmark mul!($C,$A,$B)
BenchmarkTools.Trial: 11 samples with 1 evaluation.
 Range (min … max):  96.795 ms … 97.115 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     96.908 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   96.927 ms ± 94.513 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁      ▁     ▁    ▁ █    █▁                        ▁      ▁  
  █▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁█▁█▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁█ ▁
  96.8 ms         Histogram: frequency by time        97.1 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
chriselrod commented 11 months ago

You could try setting some number of threads (less than or equal to the number of performance cores) and includeing tilesearch.jl.

Note, it'll take hours to run. You can decide for how long to run it (e.g., it currently says 8 * hours for 8 hours), and then check Optim.minimizer(opt).

Note that you'd need the version of the file from this PR to work correctly. If it gets a big improvement in GFLOPS over where it started, then you could report the numbers and I'll set them as the default.

JohannesNaegele commented 11 months ago

It still loses to OpenBLAS


julia> using LinearAlgebra; BLAS.set_num_threads(1);

As it turns out, I was under the false impression that JULIA_NUM_THREADS = 1 would imply that BLAS uses only 1 thread as well (yikes).

julia> using Octavian; using BenchmarkTools

julia> A = rand(2000,2000); B = rand(2000,2000); C = rand(2000,2000);

julia> using LinearAlgebra; BLAS.set_num_threads(1);

julia> @benchmark mul!($C,$A,$B)
BenchmarkTools.Trial: 16 samples with 1 evaluation.
 Range (min … max):  315.903 ms … 359.456 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     328.384 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   327.535 ms ±  11.794 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                              
  █▁▆▁▆▆▁▁▁▁▁▁▁▁▁▆▁▁▆▆▆▁▁▁▁▆▆▆▁▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
  316 ms           Histogram: frequency by time          359 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark Octavian.matmul!($C, $A, $B)
BenchmarkTools.Trial: 14 samples with 1 evaluation.
 Range (min … max):  353.367 ms … 367.733 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     363.146 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   361.958 ms ±   4.533 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁      ▁       ▁ ▁            ▁         █ ▁▁     ▁▁  ▁    ▁ ▁  
  █▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁██▁▁▁▁▁██▁▁█▁▁▁▁█▁█ ▁
  353 ms           Histogram: frequency by time          368 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

Nevertheless, this version seems to be slower than main. I will run the tile search overnight though.

chriselrod commented 11 months ago

Also, if you're really interested in BLAS performance on Apple Silicon, you should try AppleAccelerate. Simply using the library should swap OpenBLAS for Accelerate, which uses the chip's undocumented matrix instructions instead of NEON like OpenBLAS and Octavian.

These run on a separate unit, of which I believe there is only one per chip. But that lets a single core get peak performance while the others do something else.

JohannesNaegele commented 11 months ago

I ran it for 7 hours and now it is around the speed of main:

julia> Optim.minimizer(opt)
4-element Vector{Float64}:
 0.1464967933382345
 0.07243228432052636
 0.5024723443788641
 0.9018940596921994

 julia> @btime matmul_pack_ab!(C, A, B,
           Val(Octavian.StaticFloat64{0.1464967933382345}()),
           Val(Octavian.StaticFloat64{0.07243228432052636}()),
           Val(Octavian.StaticFloat64{0.5024723443788641}()), 
           Val(Octavian.StaticFloat64{0.9018940596921994}()),
       )
  682.623 ms (1 allocation: 16 bytes)
0.347407875

julia> 682.623/2
341.3115
JohannesNaegele commented 11 months ago

Also, if you're really interested in BLAS performance on Apple Silicon, you should try AppleAccelerate. Simply using the library should swap OpenBLAS for Accelerate, which uses the chip's undocumented matrix instructions instead of NEON like OpenBLAS and Octavian.

I am just generally interested in matmul acceleration (I have some experience in GPU acceleration with custom CUTLASS but little understanding of CPU acceleration, especially considering how exactly L1, L2, L3 influence the choice of parameters...). Maybe I will try to check out what happens if one uses a AppleAccelerate micro kernel and custom blocking.

These run on a separate unit, of which I believe there is only one per chip. But that lets a single core get peak performance while the others do something else.

Interesting 🤔