Closed domschl closed 1 year ago
The plan was to have LBT as a way to pick a different BLAS than the default OpenBLAS for now. That requires you to load a package every time you start Julia to change the default. Eventually, once the Preferences mechanism becomes standard, we want to use that so that users can pick a different BLAS by default.
I don't think we want to depend on the Apple provided BLAS by default for the M1 for now.
Some (anecdotal) benchmark scenarios that might illustrate why Accelerate makes sense (at least for gemm
type of operations). I've done a machine learning framework and benchmark suite, syncognite/bench written in C++ using eigen3 (so neither Julia nor OpenBLAS), but it might serve as a hint for what kind of performance improvements are possible.
The following table compare a benchmark run with native eigen3 on ARM64, and a second run using Apple's Accelerate within eigen (USE_SYSTEM_BLAS
variable). The percentage columns show the speed-up for forward (fw) and backward (bw) path of neural network training when using accelerate, e.g:
all that with very low energy usage!
Layer fw(ms) fw/N(ms) % bw(ms) bw/N(ms) %
OneHot 1.068 0.0107 0.1 0.000 0.0000 66.6
Affine 0.464 0.0050 620.1 1.210 0.0130 485.4
AffineRelu 0.538 0.0056 506.4 1.294 0.0135 424.4
Relu 0.023 0.0002 5.8 0.423 0.0042 -1.8
Nonlin-Relu 0.064 0.0006 -1.9 0.067 0.0007 0.7
Nonlin-Sgmd 0.126 0.0012 -1.2 0.067 0.0007 -1.8
Nonlin-Tanh 0.099 0.0010 -0.0 0.070 0.0007 -0.5
Nonlin-Selu 0.746 0.0075 -0.1 0.649 0.0065 -0.4
Nonlin-Resilu 0.202 0.0020 0.2 0.243 0.0024 -0.6
Dropout 0.713 0.0072 -0.8 0.030 0.0003 1.5
BatchNorm 0.159 0.0016 -3.1 0.217 0.0022 -4.6
SpatialBtchNrm 2.610 0.0260 -1.0 3.936 0.0393 -0.2
Convolution 54.388 0.5470 20.3 58.559 0.5869 45.1
Pooling 6.421 0.0643 -0.0 22.589 0.2262 1.7
LSTM 27.616 0.2778 299.8 78.172 0.7677 198.0
RNN 16.929 0.1660 110.4 33.648 0.3297 114.6
TemporalAff 5.704 0.0569 105.7 5.545 0.0554 209.3
WordEmbed 5.379 0.0541 129.1 4.227 0.0419 116.0
SVM 0.229 0.0023 0.5 0.225 0.0022 0.9
Softmax 0.228 0.0023 0.1 0.022 0.0002 1.5
TemporalSM 4.356 0.0434 -0.8 1.401 0.0140 0.1
TwoLayerNet 1.776 0.0177 300.5 2.555 0.0255 425.1
I think @chriselrod has played around with this a little bit.
The first thing is to create an MKL.jl like package for Apple Accelerate. We already have some support in LBT for Accelerate - so this should be fairly quick.
A difficulty is that accelerate uses the 32 bit API, which is why AppleAccelerateLinAlgWrapper manually defines the methods it uses (and is based on Elliot's code).
(Also, AppleAccelerateLinAlgWrapper has a deliberately cumbersome name to avoid stealing/squatting on potentially valuable names for a future such package that supersedes it.)
Since Accelerate only has an LP64 BLAS and Julia wants ILP64, this is quite difficult, unless we can somehow make this choice dynamic as discussed in https://github.com/JuliaLang/julia/issues/43304.
It should be possible to have a separate wrapper like @chriselrod discusses above that packages can directly invoke, but swapping it in as the default BLAS in Julia is fairly non-trivial.
I‘m no expert but wouldn‘t it be quite easy to write some kind of wrapper libblas which just redirects level 3 BLAS calls to the Apple accelerate BLAS and all other calls to OpenBLAS? I mean ILP64 does not really play a role for level 3 BLAS imho anyway. On the other hand, level 3 BLAS routines are probably the only routines which benefit from Apple‘s AMX extension…
Yes we could have a wrapper library that redirects all 64-bit ILP64 calls to a 32-bit BLAS. It seems like it would be easier to have Apple just provide ILP64 support with mangled names. Intel is doing that in MKL now.
Do we have a way to ask Apple to do this? @Keno @staticfloat Perhaps one of you had a contact at Apple?
We can ask.
Nice that you guys have the appropriate contacts ;-) However, what I heard from other discussions, Apple seams to assign currently very little resources to their BLAS/LAPACK development. So, I wouldn't bet on them... Nevertheless, I keep my fingers crossed ^^ The other proper solution would be to have AMX kernels implemented in OpenBLAS. It seems that (Apple M1) AMX has been decrypted now (inofficially) more or less completly: https://github.com/corsix/amx. I guess with this knowledge the guys from OpenBLAS should be able to do their job (if there are no legal restriction applying).
Please do file the pointer to Apple AMX kernels in an issue on the openblas github repo. Yes, it would be great for openblas to have those kernels.
I tried my best and opened a new issue there (see https://github.com/xianyi/OpenBLAS/issues/3789). Let's see what they think about that.
Might be relevant: https://github.com/mlpack/mlpack/pull/3308
I just got a shiny new Mac Mini with an M2 Pro, so I thought I see how Apple Acclerate scaled. I timed gemm and lu with both OpenBLAS and Acclerate. It seems that Accelerate's advantage declines as the problem size increases. This is worse ith lu than gemm.
It's also interesting, at least to me, that Acclerate does so well for the single precision matrix multiply.
This is far from a definitive analysis, but makes me nervous about swapping OpenBlas for Accelerate.
I ran this on 1.9-beta3
julia> using Random, AppleAccelerateLinAlgWrapper, BenchmarkTools
julia> function testme(T = Float64)
Random.seed!(46071)
for p = 8:13
N = 2^p
A = rand(T, N, N)
tblasmm = @belapsed $A * $A
taccmm = @belapsed AppleAccelerateLinAlgWrapper.gemm($A, $A)
println("MM: Dim= $N. BLAS time = $tblasmm. Apple time = $taccmm")
tblaslu = @belapsed lu($A)
tacclu = @belapsed AppleAccelerateLinAlgWrapper.lu($A)
println("LU: Dim= $N. BLAS time = $tblaslu. Apple time = $tacclu")
end
end
testme (generic function with 2 methods)
The results for double precision:
julia> testme()
MM: Dim= 256. BLAS time = 1.84333e-04. Apple time = 9.31660e-05
LU: Dim= 256. BLAS time = 7.17958e-04. Apple time = 1.82250e-04
MM: Dim= 512. BLAS time = 8.25958e-04. Apple time = 4.33709e-04
LU: Dim= 512. BLAS time = 1.27475e-03. Apple time = 1.05083e-03
MM: Dim= 1024. BLAS time = 6.32771e-03. Apple time = 3.39408e-03
LU: Dim= 1024. BLAS time = 4.43283e-03. Apple time = 3.46121e-03
MM: Dim= 2048. BLAS time = 4.78387e-02. Apple time = 2.98090e-02
LU: Dim= 2048. BLAS time = 2.33295e-02. Apple time = 1.78365e-02
MM: Dim= 4096. BLAS time = 3.94479e-01. Apple time = 2.33508e-01
LU: Dim= 4096. BLAS time = 1.61876e-01. Apple time = 1.50505e-01
MM: Dim= 8192. BLAS time = 3.11572e+00. Apple time = 1.82149e+00
LU: Dim= 8192. BLAS time = 1.17175e+00. Apple time = 2.54976e+00
and for single
julia> testme(Float32)
MM: Dim= 256. BLAS time = 1.34667e-04. Apple time = 2.45840e-05
LU: Dim= 256. BLAS time = 6.53583e-04. Apple time = 1.25709e-04
MM: Dim= 512. BLAS time = 4.42458e-04. Apple time = 1.05875e-04
LU: Dim= 512. BLAS time = 1.26879e-03. Apple time = 5.36500e-04
MM: Dim= 1024. BLAS time = 3.32025e-03. Apple time = 8.74250e-04
LU: Dim= 1024. BLAS time = 3.40737e-03. Apple time = 2.66488e-03
MM: Dim= 2048. BLAS time = 2.44754e-02. Apple time = 9.16629e-03
LU: Dim= 2048. BLAS time = 1.38886e-02. Apple time = 1.42406e-02
MM: Dim= 4096. BLAS time = 1.94998e-01. Apple time = 7.03759e-02
LU: Dim= 4096. BLAS time = 8.70666e-02. Apple time = 8.09671e-02
MM: Dim= 8192. BLAS time = 1.54402e+00. Apple time = 5.09572e-01
LU: Dim= 8192. BLAS time = 6.15488e-01. Apple time = 6.45579e-01
So the decision might depend on your application scenario.
For machine learning, the decision would be clear (tested on MacBook Pro M2 Max, Julia head from 2023-01-26):
testme(Float32)
MM: Dim= 256. BLAS time = 5.1542e-5. Apple time = 2.4458e-5 factor=2.107367732439284
MM: Dim= 512. BLAS time = 0.000364542. Apple time = 0.0001065 factor=3.4229295774647883
MM: Dim= 1024. BLAS time = 0.003256166. Apple time = 0.000854958 factor=3.8085683741189627
MM: Dim= 2048. BLAS time = 0.024809042. Apple time = 0.008663458 factor=2.8636419776029385
MM: Dim= 4096. BLAS time = 0.205842959. Apple time = 0.067481875 factor=3.0503443924757576
MM: Dim= 8192. BLAS time = 1.73104175. Apple time = 0.503544333 factor=3.4377146887680294
A pretty consistent 3x speed advantage of Accelerate over OpenBLAS for matrix sizes relevant for machine learning operations.
I'd expect OpenBLAS sgemm to take at least 1.3 seconds with 8 cores for the 8192x8192 matrices:
julia> 2 * 8192^3 / (4*4*2*3.25e9*8)
1.3215283987692308
It requires 8192^3 FLOPs.
The CPU has 4 execution units with 4 Float32
each, doing 2 FLOPs/fma instruction, running at around 3.25e9 clock cycles/second, and there are 8 cores.
So, 0.615s reported by @ctkelley sounds too fast, and @domschl's 1.73s realistic.
Odd that @domschl's accelerate time was faster (0.5 vs 0.65s).
My .615 was for LU. My MM numbers are pretty close to what @domschl got. So OpenBLAS LU time is roughly 1/3 OpenBLAS MM time, as you would expect. The Apple LU times are hard for me to understand as the dimension grows. For dim = 8192, LU takes more time than MM.
might be interesting simplification, new support for ILP64 interface:
Nice!
Okay, I spun up a VM and tried it out. The good news is, many things work! The bad news is, it requires a hack to LBT to use their symbol names since they don't use a simple suffix on the F77 symbols, they drop the trailing underscore from the symbol name (e.g. dgemm_
-> dgemm$NEWLAPACK$ILP64
). I've requested that they keep that trailing underscore (both after dgemm
and after ILP64
, to maintain compatibility with gfortran compilers which require a trailing underscore for all symbol names) but we'll see what they say. Another good piece of news is that their LAPACK implementation has been updated from 3.2.x to 3.9.x, so I think we're seeing a significant increase in Accelerate functionality!
I was running inside of a VM so benchmarks are basically useless, so all I'll say is that Accelerate (in the VM) was faster than OpenBLAS (in the VM) by about a factor of 3x when running peakflops()
.
I suppose that LBT can pick Accelerate if we are on the right macOS version in the default Julia build, or default to openblas (which we would continue to ship for a long time). This saves the effort of making our BLAS runtime switchable. Apple was one of the last holdouts.
OpenBLAS does have a multi-threaded solvers (it patches LAPACK), so I am curious how the LU and cholesky factorization performance stacks up.
I suppose that LBT can pick Accelerate if we are on the right macOS version in the default Julia build
Yes, such a switch is actually quite easy to implement; we can even just try loading ILP64 Accelerate, and if it fails, we load OpenBLAS instead. It would go right here: https://github.com/JuliaLang/julia/blob/7ba7e326293bd3eddede81567bbe98078e81f775/stdlib/LinearAlgebra/src/LinearAlgebra.jl#L645. We could also have it set via a Preference, or something like that.
I suppose that LBT can pick Accelerate if we are on the right macOS version in the default Julia build, or default to openblas (which we would continue to ship for a long time). This saves the effort of making our BLAS runtime switchable. Apple was one of the last holdouts.
OpenBLAS does have a multi-threaded solvers (it patches LAPACK), so I am curious how the LU and cholesky factorization performance stacks up.
I'd like to see how the matrix factorizations do as well. Things were a bit strange (see my post above) with the old version. If everything is 3X faster now, everyone wins.
From my anecdotal experience with ICA (running in Python via NumPy), I found that Accelerate is between 5x – 15x faster than OpenBLAS. The OpenBLAS implementation is so slow that even a 10-year-old Intel Mac has pretty much the same performance.
Again, this is only anecdotal evidence, and I am certainly not trying to bash OpenBLAS. However, I think this might underscore the importance of being able to use an optimized BLAS on every supported platform. Currently, this means that BLAS-dependent calculations are much slower than they could/should be on Apple Silicon, to the point that (depending on the algorithm of course) Julia could not the best choice for extremely fast performance anymore.
Are you seeing this speedup for factorizations (LU, Cholesky, SVD, QR, ...)? I am not for the older version (pre OS 12.3) version of Accelerate.
To be honest, I don't know which operations are involved in those ICA algorithms, but I'm guessing that SVD is very likely part of it. I am on macOS 13.2.1.
In case I misunderstood your question about factorization, I ran parts of the NumPy benchmark suite with OpenBLAS and Accelerate. Depending on the benchmark, I get a speedup of between 2x and 5x (note that I only ran a very small subset of benchmarks, bench_core
and bench_linalg
).
I'm attaching the results for the SVD portion of the benchmark.
With macos 13.3 introducing 64-bit BLAS, we should be able to use Accelerate.
In case I misunderstood your question about factorization, I ran parts of the NumPy benchmark suite with OpenBLAS and Accelerate. Depending on the benchmark, I get a speedup of between 2x and 5x (note that I only ran a very small subset of benchmarks,
bench_core
andbench_linalg
).I'm attaching the results for the SVD portion of the benchmark.
Accelerate:
svd int16 898±0.9μs svd float16 n/a svd int32 893±5μs svd float32 895±5μs svd int64 887±2μs svd float64 888±4μs svd complex64 2.29±0.01ms svd longfloat 888±4μs svd complex128 2.27±0.02ms
OpenBLAS:
svd int16 1.65±0.3ms svd float16 n/a svd int32 1.40±0.3ms svd float32 1.53±0.3ms svd int64 1.41±0.2ms svd float64 1.53±0.2ms svd complex64 3.11±0.2ms svd longfloat 1.57±0.3ms svd complex128 2.88±0.1ms
What was the dimension of this problem?
I ran this file:
https://github.com/numpy/numpy/blob/main/benchmarks/benchmarks/bench_linalg.py
So I think it should be a 150 × 400 array.
FYI macOS 13.3 is out.
I installed the new OS this morning and redid my experiments. I got similar results. I'm not sure if I'm getting to the new accelerator. I did not change the line
BLAS.lbt_forward("/System/Library/Frameworks/Accelerate.framework/Versions/A/Accelerate")
in AppleAccelerateLinAlgWrapper.jl
If I did get the new accelerator, then there is no change in performance I can see.
I tried changing that line to
BLAS.lbt_forward("/System/Library/Frameworks/Accelerate.framework/Versions/A/AccelerateNew")
and using AppleAccelerateLinAlgWrapper
did not complain. However, when I tried to run the experiment II got several lines of
Error: no BLAS/LAPACK library loaded!
Any ideas out there?
I don't have anything called AccelerateNew
in that folder.
I don’t see Accelerate either C. T. @. Mar 28, 2023, at 11:34 AM, Viral B. Shah @.> wrote: I don't have anything called AccelerateNew in that folder.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
I see
➜ vecLib.framework ls
Resources libLAPACK.dylib libvDSP.dylib
Versions libLinearAlgebra.dylib libvMisc.dylib
libBLAS.dylib libQuadrature.dylib vecLib
libBNNS.dylib libSparse.dylib
libBigNum.dylib libSparseBLAS.dylib
but those are all symlinks that don't point to anything. Not sure how all that works on macos.
Im trying to pattern match @chriselrod package with very little understanding about what is happening. C. T. @. Mar 28, 2023, at 11:38 AM, Viral B. Shah @.> wrote: I see ➜ vecLib.framework ls Resources libLAPACK.dylib libvDSP.dylib Versions libLinearAlgebra.dylib libvMisc.dylib libBLAS.dylib libQuadrature.dylib vecLib libBNNS.dylib libSparse.dylib libBigNum.dylib libSparseBLAS.dylib
but those are all symlinks that don't point to anything. Not sure how all that works on macos.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
I can do this - but it all looks like it is LP64. So one needs to figure out how to get to the ILP64:
julia> BLAS.lbt_forward("/System/Library/Frameworks/vecLib.framework/libLAPACK.dylib")
1705
julia> BLAS.lbt_get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
├ [ILP64] libopenblas64_.dylib
└ [ LP64] libLAPACK.dylib
The release notes at https://developer.apple.com/documentation/macos-release-notes/macos-13_3-release-notes say:
The BLAS and LAPACK libraries under the Accelerate framework are now inline with reference version 3.9.1. These new interfaces provide additional functionality and a new ILP64 interface. To use the new interfaces, define ACCELERATE_NEW_LAPACK before including the Accelerate or vecLib headers. For ILP64 interfaces, also define ACCELERATE_LAPACK_ILP64. For Swift projects, specify ACCELERATE_NEW_LAPACK=1 and ACCELERATE_LAPACK_ILP64=1 as preprocessor macros in Xcode build settings. (105572917)
I guess building a small project with those flags using XCODE and check what they are linking to?
Xcode 14.3 doesn't seem to be out yet.
The issue is that the new ILP64 symbols aren't just a suffix of the LP64 symbols; dgemm_
(LP64) gets changed to dgemm$NEWLAPACK$ILP64
(note the missing ending underscore). So LBT can't find the symbols, because it only looks for symbols that have prefixes/suffixes on the typical name. It doesn't know how to remove pieces of the symbol name. I'll have add that in to LBT properly.
I've tried to create a minimalistic Xcode test program to check whether ILP64 is working using the instructions communicated by Apple. Until now, I was unable to make it work. Maybe I'm doing something wrong or maybe Apple still needs to deliver an Xcode update. You can paste the attached code to your command line to try it by yourself (at least 16GB RAM needed). test_apple_ilp64.sh.txt
export ACCELERATE_NEW_LAPACK=1
export ACCELERATE_LAPACK_ILP64=1
Those should be -DACCELERATE_NEW_LAPACK=1 -DACCELERATE_LAPACK_ILP64=1
on your clang
commandline.
Many thanks for your suggestion, however, adding these macros isn't changing the result on my Mac. Do I have to call a special function? I assumed that by using these macros/variables Apple automatically links the correct version of the function. test_apple_ilp64.sh.txt
I maybe Apple still needs to deliver an Xcode update. You can paste the attached code to your command line to try it by yourself (at least 16GB RAM needed).
The release candidate of Xcode 14.3 is ready for download at https://xcodereleases.com
I got it and ran your script. Here's what happened
% sh test_apple_ilp64.sh-2.txt
test_apple_ilp64.c:17:11: warning: 'cblas_ddot' is only available on macOS 13.3 or newer [-Wunguarded-availability-new]
dotaa=cblas_ddot(sa, a, inca, a, inca);
^~~~~~~~~~
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas_new.h:219:8: note: 'cblas_ddot' has been marked as being introduced in macOS 13.3 here, but the deployment target is macOS 13.0.0
double cblas_ddot(const __LAPACK_int N, const double * _Nullable X, const __LAPACK_int INCX,
^
test_apple_ilp64.c:17:11: note: enclose 'cblas_ddot' in a __builtin_available check to silence this warning
dotaa=cblas_ddot(sa, a, inca, a, inca);
^~~~~~~~~~
1 warning generated.
DDOT ILP64 CHECK: ARRAY SIZE=2147483648, CORRECT RESULT=8589934592.00000, BLAS RESULT=8589934592.000000
I don't understand this. I am running 13.3 and did a restart to make sure that the new xcode was running. I got the (very tedious) license BS you get when starting xcode for the first time, so I think that's right. However, your script generates an error telling me that something is out of date.
That looks better! Use nm ./test_apple_ilp64
to see what symbols it's linking against; you want to see the $NEWLAPACK$ILP64
suffixes on your ddot
symbols.
I maybe Apple still needs to deliver an Xcode update. You can paste the attached code to your command line to try it by yourself (at least 16GB RAM needed).
The release candidate of Xcode 14.3 is ready for download at https://xcodereleases.com
Many thanks, this resolved the issue :) In addition to MacOS 13.3 you obviously also need Xcode 14.3 to make it work. I think, the warning you get (I get the same) should just inform you that this code will not run on MacOS versions < 13.3.
That looks better! Use
nm ./test_apple_ilp64
to see what symbols it's linking against; you want to see the$NEWLAPACK$ILP64
suffixes on yourddot
symbols.
Yes, you were right. These variables need to be set as preprocessor directives. Just as an idea: would it be possible to add the very same preprocessor variables to the Julia build environment? In this way the macros defined by Apple (see, e.g., lapack_version.h) should automatically pick the correct symbol for the user environment. However, probably this is not compatible with the logic of LBT and the rest...
There may be a hardware explanation for why Accelerate is consistently slower than OpenBLAS (for everything except SGEMM, DGEMM, or other functions that basically wrap GEMM in a different interface):
https://github.com/corsix/amx/issues/6#issuecomment-1477091144
Also disappointing: AMX vector throughput is less than CPU NEON vector throughput. Perhaps that's why Apple's BLAS library consistently underperforms OpenBLAS by a factor of two. Instead of using the NEON units in a multithreaded setting, the CPUs all fight for the same AMX block with less theoretical FLOPS.
Most of linear algebra is O(n^3), but few O(n^3) algorithms can have all 3 n
parallelized. In a sense, this is why very few algorithms run faster on the AMX.
There may be a hardware explanation for why Accelerate is consistently slower than OpenBLAS (for everything except SGEMM, DGEMM, or other functions that basically wrap GEMM in a different interface):
So, would it make sense to use Apple's BLAS and the LAPACK from Open BLAS? Can LBT do that?
So, would it make sense to use Apple's BLAS and the LAPACK from Open BLAS? Can LBT do that?
We make a wrapper. Call into Accelerate for GEMM, OpenBLAS for everything else.
There are quite some functions in Accelerate that are faster than the current implementation, e.g. vDSP_maxvD
vs maximum
, they are not BLAS tho.
The default BLAS Julia uses is OpenBLAS. Apple's M1 has proprietary dedicated matrix hardware that is only accessible via Apple's Accelerate BLAS implementation. That proprietary interface can provide 2x to 4x speedups for some linear algebra use cases (see https://discourse.julialang.org/t/does-mac-m1-in-multithreads-is-slower-that-in-single-thread/61114/12?u=kristoffer.carlsson for some benchmarks and discussion.)
Since Julia 1.7 there's a BLAS multiplexer:
So in theory, it should be possible to extend this so that depending on a given platform either OpenBLAS or other BLAS solutions are used transparently by default.
So this issue discusses what needs to be done to have Apple's Accelerate access to M1 hardware acceleration available by default in Julia