Closed pablosanjose closed 10 months ago
Can you try it with AppleAccelerate.jl and also 1.11-dev (which uses a new openblas)?
OpenBLAS on 1.11-dev, AppleAccelerate on 1.9.4 and AppleAccelerate on 1.10.0 all work fine.
So no need to bisect then, the fix should be JuliaLang/julia#52762, right? Could that be backported to 1.10?
EDIT: not so sure... I just tried and it still fails on JuliaLang/julia#52762. Trying JuliaLang/julia#52928 now.
Does it work if you start julia with
OPENBLAS_CORETYPE=NEOVERSEN1 julia
?
@giordano, on what version? On 1.10 the OPENBLAS_CORETYPE=NEOVERSEN1
doesn't fix it, it still fails
On JuliaLang/julia#52928 there is no need to set OPENBLAS_CORETYPE
, it works perfectly out of the box
on what version? On 1.10 the OPENBLAS_CORETYPE=NEOVERSEN1 doesn't fix it, it still fails
On 1.10.0, because it ships OpenBLAS v0.3.23 which still falls back on the generic armv8 kernels on Apple Silicon, but in v0.3.26 (what's used now in master
, or 1.11-DEV) it automatically uses the neoversen1 kernels, which are much faster (so that setting OPENBLAS_CORETYPE=NEOVERSEN1
on julia v1.10 should usually be a bit faster than the default behaviour). In Julia v1.11 you don't need to set it because you already get that (as you can see with OPENBLAS_VERBOSE=2
).
On JuliaLang/julia#52928 there is no need to set OPENBLAS_CORETYPE, it works perfectly out of the box
I'm a bit confused, the build in JuliaLang/julia#52928 shouldn't theoretically affect Apple Silicon, it doesn't have SVE, unless the problem was in the specific version of the compiler used.
On 1.10.0, because it ships OpenBLAS v0.3.23 which still falls back on the generic armv8 kernels on Apple Silicon, but in v0.3.26 (what's used now in master, or 1.11-DEV) it automatically uses the neoversen1 kernels, which are much faster (so that setting OPENBLAS_CORETYPE=NEOVERSEN1 on julia v1.10 should usually be a bit faster than the default behaviour). In Julia v1.11 you don't need to set it because you already get that (as you can see with OPENBLAS_VERBOSE=2).
Ok, so your hunch is that setting the OPENBLAS_CORETYPE
env var should be enough in 1.10? I can confirm that it is not, it still fails.
I'm a bit confused, the build in https://github.com/JuliaLang/julia/pull/52928 shouldn't theoretically affect Apple Silicon, it doesn't have SVE, unless the problem was in the specific version of the compiler used.
Me too. So I've tried JuliaLang/julia#52762 again, which failed for me before, but now it works (!). I might have forgotten to do a make cleanall
the first time.
So at this point my understanding is that something in OpenBLAS v0.3.26 fixed the issue, but that it was still present with OpenBLAS v0.3.23, even with the OPENBLAS_CORETYPE=NEOVERSEN1
envvar.
Ok, so your hunch is that setting the
OPENBLAS_CORETYPE
env var should be enough in 1.10?
Not really, I was wondering if you were hitting a bug in the generic armv8 kernels, and switching to the neoversen1 ones would work around it, but apparently it isn't the case. Using OPENBLAS_CORETYPE=NEOVERSEN1
is still useful on 1.10 to get better speed, regardless of whether it fixes your bug.
So I've tried https://github.com/JuliaLang/julia/pull/52762 again, which failed for me before, but now it works (!). I might have forgotten to do a make cleanall the first time.
That's good, switching to neoversen1 kernels wasn't the only change between v0.3.23 and v0.3.26, the diff is pretty large, mine was a low-effort attempt to guess a possible solution, but perhaps there was some other bug somewhere else.
@giordano will the openblas bump that is making its way through Yggdrasil fix the perf issue you mention here - even on 1.10?
So I managed to bisect this (took a while!) and it turns out the fix was JuliaLang/julia#51660.
EDIT: FWIW the corresponding Julia version for the fix is 1.11.0-DEV.632
Perhaps it is this? https://github.com/OpenMathLib/OpenBLAS/pull/4003
That sounds very suspicious
@pablosanjose if you have a fortran compiler on your M1 (I don't) would you be able to bisect openblas between 0.3.23 and 0.3.24? You don't need to touch Julia, can use 1.10, you only need to use BLAS.lbt_forward to run blas functions using the just built library, building openblas takes relatively little.
@pablosanjose A script like this (untested!) should help for bisecting:
#!/bin/bash
make clean || true
make DYNAMIC_ARCH=1 LIBPREFIX=libopenblas64_ INTERFACE64=1 SYMBOLSUFFIX=64_ -j || exit 125
julia --startup-file=no -e '
using LinearAlgebra
BLAS.lbt_forward(realpath("./libopenblas64_.dylib"); clear=true)
function test()
C = ComplexF32
mat = zeros(C, 1, 1)
for _ in 1:100
v = [C(1-0.2im) C(2+0.3im)]
mul!(mat, v, adjoint(v), C(1+im), 1)
end
return mat
end
exit(Int(!allequal(test() for _ in 1:100_000)))
'
You can run it with
git bisect start
git bisect bad v0.3.23
git bisect good v0.3.24
git bisect run ./script.sh
Side note, besides not having a Fortran compiler at hand, I can't even reproduce the issue on my MacBook Pro (13-inch, M1, 2020), macOS 12.6:
julia> using LinearAlgebra
julia> function test()
C = ComplexF32
mat = zeros(C, 1, 1)
for _ in 1:100
v = [C(1-0.2im) C(2+0.3im)]
mul!(mat, v, v', C(1+im), 1)
end
return mat
end
test (generic function with 1 method)
julia> allequal(test() for _ in 1:1_000_000)
true
julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: macOS (arm64-apple-darwin22.4.0)
CPU: 8 × Apple M1
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
Threads: 1 on 4 virtual cores
Same in Julia v1.9.4.
FWIW, OpenBLAS 0.3.24 release notes do say there was a bug in CGEMM for Apple aarch64. Perhaps https://github.com/OpenMathLib/OpenBLAS/pull/4003 can be backported to 0.3.23 and we can test if this issue is still present. The issue is https://github.com/OpenMathLib/OpenBLAS/issues/3995 and seems like is only on M2 and higher.
@giordano, thanks for the pointers, I'll try. It's a bit unsettling that you cannot reproduce on your machine, but your CPU is actually different. Mine's an M1 Max (macos 14.2.1) @ViralBShah , I've no idea, but https://github.com/OpenMathLib/OpenBLAS/pull/4003 includes M1 in the title, so it might still be the fix, let's see
I'm having trouble compiling. It must be some stupid path or flag issue, but I'm stuck, so I'd appreciate some advice from anyone with some make
experience. The invocation suggesed by @giordano
make clean
make DYNAMIC_ARCH=1 LIBPREFIX=libopenblas64_ INTERFACE64=1 SYMBOLSUFFIX=64_ -j
fails for me with these last lines
ranlib ../../../libopenblas64_p-r0.3.23.a
./gensymbol objconv arm64 _ 0 0 0 0 0 0 "" "64_" 1 0 1 1 1 1 > objconv.def
./gensymbol osx arm64 _ 0 0 0 0 0 0 "" "64_" 1 0 1 1 1 1 > osx.def
objconv @objconv.def ../libopenblas64_p-r0.3.23.a ../libopenblas64_p-r0.3.23.a.osx.renamed
make[1]: objconv: No such file or directory
make[1]: *** [../libopenblas64_p-r0.3.23.a.osx.renamed] Error 1
make: *** [shared] Error 2
Gah, it was staring me in the face, objconv
was not installed. Sorry for the noise.
Perhaps OpenMathLib/OpenBLAS#4003 can be backported to 0.3.23 and we can test if this issue is still present
That does look like a good candidate, but I believe having a successful bisect to directly identify the fix would be better than guessing, as well educated the guess is 🙂
To save some time I focused directly on the two subsequent commits 51dd1339e and efcf71255 (the actual https://github.com/OpenMathLib/OpenBLAS/pull/4003), and I can confirm that in v1.10, the first one fails and the second one doesn't.
Awesome, thanks for confirming that's it! Now we can backport that single patch.
@pablosanjose Will you be able to make a PR in Yggdrasil for OpenBLAS and OpenBLAS32 0.3.23?
I'd love to try! Great learning experience for me, never dipped my toes into the Tree of Life itself :-D.
I'll do my best, and I'll ask for help if required.
In this case all you have to do is to place https://github.com/OpenMathLib/OpenBLAS/commit/efcf71255aaf0c636fc76e5235dde8bb9cf76469.patch in the O/OpenBLAS/OpenBLAS@0.3.23/bundled/patches
and O/OpenBLAS/OpenBLAS32@0.3.23/bundled/patches
directories.
That simple? Ok, done!
Fixed by JuliaLang/julia#53074.
mul!
ofComplexF32
arrays is subtly broken under Apple Silicon in Julia 1.9 and 1.10Here is a MWE
This should give always
true
, but it randomly givesfalse
in the version/OSs above. The deviations of "bad" runs oftest()
(a small percentage of the 10000) is small, but well within the precision ofComplexF32
.This was tested on a Macbook Pro M1 (fails) and a Xeon server under Linux (does not fail). Can someone reproduce?
Notably, current master does not have this problem on macos. I'll try to bisect to see when it was fixed, in case we can backport the patch.