JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.69k stars 5.48k forks source link

BLAS support for M1 ARM64 via Apple's Accelerate #42312

Closed domschl closed 1 year ago

domschl commented 3 years ago

The default BLAS Julia uses is OpenBLAS. Apple's M1 has proprietary dedicated matrix hardware that is only accessible via Apple's Accelerate BLAS implementation. That proprietary interface can provide 2x to 4x speedups for some linear algebra use cases (see https://discourse.julialang.org/t/does-mac-m1-in-multithreads-is-slower-that-in-single-thread/61114/12?u=kristoffer.carlsson for some benchmarks and discussion.)

Since Julia 1.7 there's a BLAS multiplexer:

So in theory, it should be possible to extend this so that depending on a given platform either OpenBLAS or other BLAS solutions are used transparently by default.

So this issue discusses what needs to be done to have Apple's Accelerate access to M1 hardware acceleration available by default in Julia

ViralBShah commented 3 years ago

The plan was to have LBT as a way to pick a different BLAS than the default OpenBLAS for now. That requires you to load a package every time you start Julia to change the default. Eventually, once the Preferences mechanism becomes standard, we want to use that so that users can pick a different BLAS by default.

I don't think we want to depend on the Apple provided BLAS by default for the M1 for now.

domschl commented 3 years ago

Some (anecdotal) benchmark scenarios that might illustrate why Accelerate makes sense (at least for gemm type of operations). I've done a machine learning framework and benchmark suite, syncognite/bench written in C++ using eigen3 (so neither Julia nor OpenBLAS), but it might serve as a hint for what kind of performance improvements are possible.

The following table compare a benchmark run with native eigen3 on ARM64, and a second run using Apple's Accelerate within eigen (USE_SYSTEM_BLAS variable). The percentage columns show the speed-up for forward (fw) and backward (bw) path of neural network training when using accelerate, e.g:

all that with very low energy usage!

Layer             fw(ms) fw/N(ms)   %  bw(ms) bw/N(ms)   %
OneHot             1.068  0.0107  0.1   0.000  0.0000 66.6
Affine             0.464  0.0050 620.1  1.210  0.0130 485.4
AffineRelu         0.538  0.0056 506.4  1.294  0.0135 424.4
Relu               0.023  0.0002  5.8   0.423  0.0042 -1.8
Nonlin-Relu        0.064  0.0006 -1.9   0.067  0.0007  0.7
Nonlin-Sgmd        0.126  0.0012 -1.2   0.067  0.0007 -1.8
Nonlin-Tanh        0.099  0.0010 -0.0   0.070  0.0007 -0.5
Nonlin-Selu        0.746  0.0075 -0.1   0.649  0.0065 -0.4
Nonlin-Resilu      0.202  0.0020  0.2   0.243  0.0024 -0.6
Dropout            0.713  0.0072 -0.8   0.030  0.0003  1.5
BatchNorm          0.159  0.0016 -3.1   0.217  0.0022 -4.6
SpatialBtchNrm     2.610  0.0260 -1.0   3.936  0.0393 -0.2
Convolution       54.388  0.5470 20.3  58.559  0.5869 45.1
Pooling            6.421  0.0643 -0.0  22.589  0.2262  1.7
LSTM              27.616  0.2778 299.8 78.172  0.7677 198.0
RNN               16.929  0.1660 110.4 33.648  0.3297 114.6
TemporalAff        5.704  0.0569 105.7  5.545  0.0554 209.3
WordEmbed          5.379  0.0541 129.1  4.227  0.0419 116.0
SVM                0.229  0.0023  0.5   0.225  0.0022  0.9
Softmax            0.228  0.0023  0.1   0.022  0.0002  1.5
TemporalSM         4.356  0.0434 -0.8   1.401  0.0140  0.1
TwoLayerNet        1.776  0.0177 300.5  2.555  0.0255 425.1
DilumAluthge commented 3 years ago

I think @chriselrod has played around with this a little bit.

ViralBShah commented 3 years ago

The first thing is to create an MKL.jl like package for Apple Accelerate. We already have some support in LBT for Accelerate - so this should be fairly quick.

chriselrod commented 3 years ago

A difficulty is that accelerate uses the 32 bit API, which is why AppleAccelerateLinAlgWrapper manually defines the methods it uses (and is based on Elliot's code).

(Also, AppleAccelerateLinAlgWrapper has a deliberately cumbersome name to avoid stealing/squatting on potentially valuable names for a future such package that supersedes it.)

ViralBShah commented 2 years ago

Since Accelerate only has an LP64 BLAS and Julia wants ILP64, this is quite difficult, unless we can somehow make this choice dynamic as discussed in https://github.com/JuliaLang/julia/issues/43304.

It should be possible to have a separate wrapper like @chriselrod discusses above that packages can directly invoke, but swapping it in as the default BLAS in Julia is fairly non-trivial.

zinphi commented 2 years ago

I‘m no expert but wouldn‘t it be quite easy to write some kind of wrapper libblas which just redirects level 3 BLAS calls to the Apple accelerate BLAS and all other calls to OpenBLAS? I mean ILP64 does not really play a role for level 3 BLAS imho anyway. On the other hand, level 3 BLAS routines are probably the only routines which benefit from Apple‘s AMX extension…

ViralBShah commented 2 years ago

Yes we could have a wrapper library that redirects all 64-bit ILP64 calls to a 32-bit BLAS. It seems like it would be easier to have Apple just provide ILP64 support with mangled names. Intel is doing that in MKL now.

Do we have a way to ask Apple to do this? @Keno @staticfloat Perhaps one of you had a contact at Apple?

Keno commented 2 years ago

We can ask.

zinphi commented 2 years ago

Nice that you guys have the appropriate contacts ;-) However, what I heard from other discussions, Apple seams to assign currently very little resources to their BLAS/LAPACK development. So, I wouldn't bet on them... Nevertheless, I keep my fingers crossed ^^ The other proper solution would be to have AMX kernels implemented in OpenBLAS. It seems that (Apple M1) AMX has been decrypted now (inofficially) more or less completly: https://github.com/corsix/amx. I guess with this knowledge the guys from OpenBLAS should be able to do their job (if there are no legal restriction applying).

ViralBShah commented 2 years ago

Please do file the pointer to Apple AMX kernels in an issue on the openblas github repo. Yes, it would be great for openblas to have those kernels.

zinphi commented 2 years ago

I tried my best and opened a new issue there (see https://github.com/xianyi/OpenBLAS/issues/3789). Let's see what they think about that.

mzy2240 commented 1 year ago

Might be relevant: https://github.com/mlpack/mlpack/pull/3308

ctkelley commented 1 year ago

I just got a shiny new Mac Mini with an M2 Pro, so I thought I see how Apple Acclerate scaled. I timed gemm and lu with both OpenBLAS and Acclerate. It seems that Accelerate's advantage declines as the problem size increases. This is worse ith lu than gemm.

It's also interesting, at least to me, that Acclerate does so well for the single precision matrix multiply.

This is far from a definitive analysis, but makes me nervous about swapping OpenBlas for Accelerate.

I ran this on 1.9-beta3

julia> using Random, AppleAccelerateLinAlgWrapper, BenchmarkTools

julia> function testme(T = Float64)
           Random.seed!(46071)
           for p = 8:13
              N = 2^p

              A = rand(T, N, N)

              tblasmm = @belapsed $A * $A
              taccmm = @belapsed AppleAccelerateLinAlgWrapper.gemm($A, $A)
              println("MM: Dim= $N. BLAS time = $tblasmm. Apple time = $taccmm")

              tblaslu = @belapsed lu($A)
              tacclu = @belapsed AppleAccelerateLinAlgWrapper.lu($A)
              println("LU: Dim= $N. BLAS time = $tblaslu. Apple time = $tacclu")
          end

       end
testme (generic function with 2 methods)

The results for double precision:

julia> testme()
MM: Dim= 256. BLAS time = 1.84333e-04. Apple time = 9.31660e-05
LU: Dim= 256. BLAS time = 7.17958e-04. Apple time = 1.82250e-04
MM: Dim= 512. BLAS time = 8.25958e-04. Apple time = 4.33709e-04
LU: Dim= 512. BLAS time = 1.27475e-03. Apple time = 1.05083e-03
MM: Dim= 1024. BLAS time = 6.32771e-03. Apple time = 3.39408e-03
LU: Dim= 1024. BLAS time = 4.43283e-03. Apple time = 3.46121e-03
MM: Dim= 2048. BLAS time = 4.78387e-02. Apple time = 2.98090e-02
LU: Dim= 2048. BLAS time = 2.33295e-02. Apple time = 1.78365e-02
MM: Dim= 4096. BLAS time = 3.94479e-01. Apple time = 2.33508e-01
LU: Dim= 4096. BLAS time = 1.61876e-01. Apple time = 1.50505e-01
MM: Dim= 8192. BLAS time = 3.11572e+00. Apple time = 1.82149e+00
LU: Dim= 8192. BLAS time = 1.17175e+00. Apple time = 2.54976e+00

and for single

julia> testme(Float32)
MM: Dim= 256. BLAS time = 1.34667e-04. Apple time = 2.45840e-05
LU: Dim= 256. BLAS time = 6.53583e-04. Apple time = 1.25709e-04
MM: Dim= 512. BLAS time = 4.42458e-04. Apple time = 1.05875e-04
LU: Dim= 512. BLAS time = 1.26879e-03. Apple time = 5.36500e-04
MM: Dim= 1024. BLAS time = 3.32025e-03. Apple time = 8.74250e-04
LU: Dim= 1024. BLAS time = 3.40737e-03. Apple time = 2.66488e-03
MM: Dim= 2048. BLAS time = 2.44754e-02. Apple time = 9.16629e-03
LU: Dim= 2048. BLAS time = 1.38886e-02. Apple time = 1.42406e-02
MM: Dim= 4096. BLAS time = 1.94998e-01. Apple time = 7.03759e-02
LU: Dim= 4096. BLAS time = 8.70666e-02. Apple time = 8.09671e-02
MM: Dim= 8192. BLAS time = 1.54402e+00. Apple time = 5.09572e-01
LU: Dim= 8192. BLAS time = 6.15488e-01. Apple time = 6.45579e-01
domschl commented 1 year ago

So the decision might depend on your application scenario.

For machine learning, the decision would be clear (tested on MacBook Pro M2 Max, Julia head from 2023-01-26):

testme(Float32)
MM: Dim= 256. BLAS time = 5.1542e-5. Apple time = 2.4458e-5 factor=2.107367732439284
MM: Dim= 512. BLAS time = 0.000364542. Apple time = 0.0001065 factor=3.4229295774647883
MM: Dim= 1024. BLAS time = 0.003256166. Apple time = 0.000854958 factor=3.8085683741189627
MM: Dim= 2048. BLAS time = 0.024809042. Apple time = 0.008663458 factor=2.8636419776029385
MM: Dim= 4096. BLAS time = 0.205842959. Apple time = 0.067481875 factor=3.0503443924757576
MM: Dim= 8192. BLAS time = 1.73104175. Apple time = 0.503544333 factor=3.4377146887680294

A pretty consistent 3x speed advantage of Accelerate over OpenBLAS for matrix sizes relevant for machine learning operations.

chriselrod commented 1 year ago

I'd expect OpenBLAS sgemm to take at least 1.3 seconds with 8 cores for the 8192x8192 matrices:

julia> 2 * 8192^3 / (4*4*2*3.25e9*8)
1.3215283987692308

It requires 8192^3 FLOPs. The CPU has 4 execution units with 4 Float32 each, doing 2 FLOPs/fma instruction, running at around 3.25e9 clock cycles/second, and there are 8 cores.

So, 0.615s reported by @ctkelley sounds too fast, and @domschl's 1.73s realistic.

Odd that @domschl's accelerate time was faster (0.5 vs 0.65s).

ctkelley commented 1 year ago

My .615 was for LU. My MM numbers are pretty close to what @domschl got. So OpenBLAS LU time is roughly 1/3 OpenBLAS MM time, as you would expect. The Apple LU times are hard for me to understand as the dimension grows. For dim = 8192, LU takes more time than MM.

domschl commented 1 year ago

might be interesting simplification, new support for ILP64 interface:

Release Notes Ventura 13.3 Beta 3

Accelerate

New Features

ViralBShah commented 1 year ago

Nice!

staticfloat commented 1 year ago

Okay, I spun up a VM and tried it out. The good news is, many things work! The bad news is, it requires a hack to LBT to use their symbol names since they don't use a simple suffix on the F77 symbols, they drop the trailing underscore from the symbol name (e.g. dgemm_ -> dgemm$NEWLAPACK$ILP64). I've requested that they keep that trailing underscore (both after dgemm and after ILP64, to maintain compatibility with gfortran compilers which require a trailing underscore for all symbol names) but we'll see what they say. Another good piece of news is that their LAPACK implementation has been updated from 3.2.x to 3.9.x, so I think we're seeing a significant increase in Accelerate functionality!

I was running inside of a VM so benchmarks are basically useless, so all I'll say is that Accelerate (in the VM) was faster than OpenBLAS (in the VM) by about a factor of 3x when running peakflops().

ViralBShah commented 1 year ago

I suppose that LBT can pick Accelerate if we are on the right macOS version in the default Julia build, or default to openblas (which we would continue to ship for a long time). This saves the effort of making our BLAS runtime switchable. Apple was one of the last holdouts.

OpenBLAS does have a multi-threaded solvers (it patches LAPACK), so I am curious how the LU and cholesky factorization performance stacks up.

staticfloat commented 1 year ago

I suppose that LBT can pick Accelerate if we are on the right macOS version in the default Julia build

Yes, such a switch is actually quite easy to implement; we can even just try loading ILP64 Accelerate, and if it fails, we load OpenBLAS instead. It would go right here: https://github.com/JuliaLang/julia/blob/7ba7e326293bd3eddede81567bbe98078e81f775/stdlib/LinearAlgebra/src/LinearAlgebra.jl#L645. We could also have it set via a Preference, or something like that.

ctkelley commented 1 year ago

I suppose that LBT can pick Accelerate if we are on the right macOS version in the default Julia build, or default to openblas (which we would continue to ship for a long time). This saves the effort of making our BLAS runtime switchable. Apple was one of the last holdouts.

OpenBLAS does have a multi-threaded solvers (it patches LAPACK), so I am curious how the LU and cholesky factorization performance stacks up.

I'd like to see how the matrix factorizations do as well. Things were a bit strange (see my post above) with the old version. If everything is 3X faster now, everyone wins.

cbrnr commented 1 year ago

From my anecdotal experience with ICA (running in Python via NumPy), I found that Accelerate is between 5x – 15x faster than OpenBLAS. The OpenBLAS implementation is so slow that even a 10-year-old Intel Mac has pretty much the same performance.

Again, this is only anecdotal evidence, and I am certainly not trying to bash OpenBLAS. However, I think this might underscore the importance of being able to use an optimized BLAS on every supported platform. Currently, this means that BLAS-dependent calculations are much slower than they could/should be on Apple Silicon, to the point that (depending on the algorithm of course) Julia could not the best choice for extremely fast performance anymore.

ctkelley commented 1 year ago

Are you seeing this speedup for factorizations (LU, Cholesky, SVD, QR, ...)? I am not for the older version (pre OS 12.3) version of Accelerate.

cbrnr commented 1 year ago

To be honest, I don't know which operations are involved in those ICA algorithms, but I'm guessing that SVD is very likely part of it. I am on macOS 13.2.1.

cbrnr commented 1 year ago

In case I misunderstood your question about factorization, I ran parts of the NumPy benchmark suite with OpenBLAS and Accelerate. Depending on the benchmark, I get a speedup of between 2x and 5x (note that I only ran a very small subset of benchmarks, bench_core and bench_linalg).

I'm attaching the results for the SVD portion of the benchmark.

Accelerate: ``` svd int16 898±0.9μs svd float16 n/a svd int32 893±5μs svd float32 895±5μs svd int64 887±2μs svd float64 888±4μs svd complex64 2.29±0.01ms svd longfloat 888±4μs svd complex128 2.27±0.02ms ``` OpenBLAS: ``` svd int16 1.65±0.3ms svd float16 n/a svd int32 1.40±0.3ms svd float32 1.53±0.3ms svd int64 1.41±0.2ms svd float64 1.53±0.2ms svd complex64 3.11±0.2ms svd longfloat 1.57±0.3ms svd complex128 2.88±0.1ms ```
ViralBShah commented 1 year ago

With macos 13.3 introducing 64-bit BLAS, we should be able to use Accelerate.

ctkelley commented 1 year ago

In case I misunderstood your question about factorization, I ran parts of the NumPy benchmark suite with OpenBLAS and Accelerate. Depending on the benchmark, I get a speedup of between 2x and 5x (note that I only ran a very small subset of benchmarks, bench_core and bench_linalg).

I'm attaching the results for the SVD portion of the benchmark.

Accelerate:

               svd      int16       898±0.9μs   
               svd     float16         n/a      
               svd      int32        893±5μs    
               svd     float32       895±5μs    
               svd      int64        887±2μs    
               svd     float64       888±4μs    
               svd    complex64    2.29±0.01ms  
               svd    longfloat      888±4μs    
               svd    complex128   2.27±0.02ms  

OpenBLAS:

               svd      int16      1.65±0.3ms 
               svd     float16        n/a     
               svd      int32      1.40±0.3ms 
               svd     float32     1.53±0.3ms 
               svd      int64      1.41±0.2ms 
               svd     float64     1.53±0.2ms 
               svd    complex64    3.11±0.2ms 
               svd    longfloat    1.57±0.3ms 
               svd    complex128   2.88±0.1ms 

What was the dimension of this problem?

cbrnr commented 1 year ago

I ran this file:

https://github.com/numpy/numpy/blob/main/benchmarks/benchmarks/bench_linalg.py

So I think it should be a 150 × 400 array.

cbrnr commented 1 year ago

FYI macOS 13.3 is out.

ctkelley commented 1 year ago

I installed the new OS this morning and redid my experiments. I got similar results. I'm not sure if I'm getting to the new accelerator. I did not change the line

BLAS.lbt_forward("/System/Library/Frameworks/Accelerate.framework/Versions/A/Accelerate") 

in AppleAccelerateLinAlgWrapper.jl

If I did get the new accelerator, then there is no change in performance I can see.

I tried changing that line to

BLAS.lbt_forward("/System/Library/Frameworks/Accelerate.framework/Versions/A/AccelerateNew") 

and using AppleAccelerateLinAlgWrapper did not complain. However, when I tried to run the experiment II got several lines of

Error: no BLAS/LAPACK library loaded!

Any ideas out there?

ViralBShah commented 1 year ago

I don't have anything called AccelerateNew in that folder.

ctkelley commented 1 year ago

I don’t see Accelerate either   C. T. @. Mar 28, 2023, at 11:34 AM, Viral B. Shah @.> wrote: I don't have anything called AccelerateNew in that folder.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

ViralBShah commented 1 year ago

I see

➜  vecLib.framework ls 
Resources              libLAPACK.dylib        libvDSP.dylib
Versions               libLinearAlgebra.dylib libvMisc.dylib
libBLAS.dylib          libQuadrature.dylib    vecLib
libBNNS.dylib          libSparse.dylib
libBigNum.dylib        libSparseBLAS.dylib

but those are all symlinks that don't point to anything. Not sure how all that works on macos.

ctkelley commented 1 year ago

Im trying to pattern match @chriselrod package with very little understanding about what is happening. C. T. @. Mar 28, 2023, at 11:38 AM, Viral B. Shah @.> wrote: I see ➜ vecLib.framework ls Resources libLAPACK.dylib libvDSP.dylib Versions libLinearAlgebra.dylib libvMisc.dylib libBLAS.dylib libQuadrature.dylib vecLib libBNNS.dylib libSparse.dylib libBigNum.dylib libSparseBLAS.dylib

but those are all symlinks that don't point to anything. Not sure how all that works on macos.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

ViralBShah commented 1 year ago

I can do this - but it all looks like it is LP64. So one needs to figure out how to get to the ILP64:

julia> BLAS.lbt_forward("/System/Library/Frameworks/vecLib.framework/libLAPACK.dylib")
1705

julia> BLAS.lbt_get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries: 
├ [ILP64] libopenblas64_.dylib
└ [ LP64] libLAPACK.dylib

The release notes at https://developer.apple.com/documentation/macos-release-notes/macos-13_3-release-notes say:

The BLAS and LAPACK libraries under the Accelerate framework are now inline with reference version 3.9.1. These new interfaces provide additional functionality and a new ILP64 interface. To use the new interfaces, define ACCELERATE_NEW_LAPACK before including the Accelerate or vecLib headers. For ILP64 interfaces, also define ACCELERATE_LAPACK_ILP64. For Swift projects, specify ACCELERATE_NEW_LAPACK=1 and ACCELERATE_LAPACK_ILP64=1 as preprocessor macros in Xcode build settings. (105572917)

gbaraldi commented 1 year ago

I guess building a small project with those flags using XCODE and check what they are linking to?

ViralBShah commented 1 year ago

Xcode 14.3 doesn't seem to be out yet.

staticfloat commented 1 year ago

The issue is that the new ILP64 symbols aren't just a suffix of the LP64 symbols; dgemm_ (LP64) gets changed to dgemm$NEWLAPACK$ILP64 (note the missing ending underscore). So LBT can't find the symbols, because it only looks for symbols that have prefixes/suffixes on the typical name. It doesn't know how to remove pieces of the symbol name. I'll have add that in to LBT properly.

zinphi commented 1 year ago

I've tried to create a minimalistic Xcode test program to check whether ILP64 is working using the instructions communicated by Apple. Until now, I was unable to make it work. Maybe I'm doing something wrong or maybe Apple still needs to deliver an Xcode update. You can paste the attached code to your command line to try it by yourself (at least 16GB RAM needed). test_apple_ilp64.sh.txt

staticfloat commented 1 year ago
export ACCELERATE_NEW_LAPACK=1
export ACCELERATE_LAPACK_ILP64=1

Those should be -DACCELERATE_NEW_LAPACK=1 -DACCELERATE_LAPACK_ILP64=1 on your clang commandline.

zinphi commented 1 year ago

Many thanks for your suggestion, however, adding these macros isn't changing the result on my Mac. Do I have to call a special function? I assumed that by using these macros/variables Apple automatically links the correct version of the function. test_apple_ilp64.sh.txt

ctkelley commented 1 year ago

I maybe Apple still needs to deliver an Xcode update. You can paste the attached code to your command line to try it by yourself (at least 16GB RAM needed).

The release candidate of Xcode 14.3 is ready for download at https://xcodereleases.com

I got it and ran your script. Here's what happened

% sh test_apple_ilp64.sh-2.txt
test_apple_ilp64.c:17:11: warning: 'cblas_ddot' is only available on macOS 13.3 or newer [-Wunguarded-availability-new]
    dotaa=cblas_ddot(sa, a, inca, a, inca);
          ^~~~~~~~~~
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas_new.h:219:8: note: 'cblas_ddot' has been marked as being introduced in macOS 13.3 here, but the deployment target is macOS 13.0.0
double cblas_ddot(const __LAPACK_int N, const double * _Nullable X, const __LAPACK_int INCX,
       ^
test_apple_ilp64.c:17:11: note: enclose 'cblas_ddot' in a __builtin_available check to silence this warning
    dotaa=cblas_ddot(sa, a, inca, a, inca);
          ^~~~~~~~~~
1 warning generated.
DDOT ILP64 CHECK: ARRAY SIZE=2147483648, CORRECT RESULT=8589934592.00000, BLAS RESULT=8589934592.000000

I don't understand this. I am running 13.3 and did a restart to make sure that the new xcode was running. I got the (very tedious) license BS you get when starting xcode for the first time, so I think that's right. However, your script generates an error telling me that something is out of date.

staticfloat commented 1 year ago

That looks better! Use nm ./test_apple_ilp64 to see what symbols it's linking against; you want to see the $NEWLAPACK$ILP64 suffixes on your ddot symbols.

zinphi commented 1 year ago

I maybe Apple still needs to deliver an Xcode update. You can paste the attached code to your command line to try it by yourself (at least 16GB RAM needed).

The release candidate of Xcode 14.3 is ready for download at https://xcodereleases.com

Many thanks, this resolved the issue :) In addition to MacOS 13.3 you obviously also need Xcode 14.3 to make it work. I think, the warning you get (I get the same) should just inform you that this code will not run on MacOS versions < 13.3.

That looks better! Use nm ./test_apple_ilp64 to see what symbols it's linking against; you want to see the $NEWLAPACK$ILP64 suffixes on your ddot symbols.

Yes, you were right. These variables need to be set as preprocessor directives. Just as an idea: would it be possible to add the very same preprocessor variables to the Julia build environment? In this way the macros defined by Apple (see, e.g., lapack_version.h) should automatically pick the correct symbol for the user environment. However, probably this is not compatible with the logic of LBT and the rest...

philipturner commented 1 year ago

There may be a hardware explanation for why Accelerate is consistently slower than OpenBLAS (for everything except SGEMM, DGEMM, or other functions that basically wrap GEMM in a different interface):

https://github.com/corsix/amx/issues/6#issuecomment-1477091144

Also disappointing: AMX vector throughput is less than CPU NEON vector throughput. Perhaps that's why Apple's BLAS library consistently underperforms OpenBLAS by a factor of two. Instead of using the NEON units in a multithreaded setting, the CPUs all fight for the same AMX block with less theoretical FLOPS.

Most of linear algebra is O(n^3), but few O(n^3) algorithms can have all 3 n parallelized. In a sense, this is why very few algorithms run faster on the AMX.

ctkelley commented 1 year ago

There may be a hardware explanation for why Accelerate is consistently slower than OpenBLAS (for everything except SGEMM, DGEMM, or other functions that basically wrap GEMM in a different interface):

So, would it make sense to use Apple's BLAS and the LAPACK from Open BLAS? Can LBT do that?

philipturner commented 1 year ago

So, would it make sense to use Apple's BLAS and the LAPACK from Open BLAS? Can LBT do that?

We make a wrapper. Call into Accelerate for GEMM, OpenBLAS for everything else.

mzy2240 commented 1 year ago

There are quite some functions in Accelerate that are faster than the current implementation, e.g. vDSP_maxvD vs maximum, they are not BLAS tho.