JuliaLang / LinearAlgebra.jl

Julia Linear Algebra standard library
Other
17 stars 4 forks source link

Non-deterministic `mul!` of `ComplexF32` arrays under Apple Silicon #1051

Closed pablosanjose closed 10 months ago

pablosanjose commented 10 months ago

mul! of ComplexF32 arrays is subtly broken under Apple Silicon in Julia 1.9 and 1.10

Here is a MWE

using LinearAlgebra

function test()
    C = ComplexF32
    mat = zeros(C, 1, 1)
    for _ in 1:100
        v = [C(1-0.2im) C(2+0.3im)]
        mul!(mat, v, v', C(1+im), 1)
    end
    return mat
end

allequal(test() for _ in 1:10000)

This should give always true, but it randomly gives false in the version/OSs above. The deviations of "bad" runs of test() (a small percentage of the 10000) is small, but well within the precision of ComplexF32.

This was tested on a Macbook Pro M1 (fails) and a Xeon server under Linux (does not fail). Can someone reproduce?

Notably, current master does not have this problem on macos. I'll try to bisect to see when it was fixed, in case we can backport the patch.

ViralBShah commented 10 months ago

Can you try it with AppleAccelerate.jl and also 1.11-dev (which uses a new openblas)?

pablosanjose commented 10 months ago

OpenBLAS on 1.11-dev, AppleAccelerate on 1.9.4 and AppleAccelerate on 1.10.0 all work fine.

So no need to bisect then, the fix should be JuliaLang/julia#52762, right? Could that be backported to 1.10?

EDIT: not so sure... I just tried and it still fails on JuliaLang/julia#52762. Trying JuliaLang/julia#52928 now.

giordano commented 10 months ago

Does it work if you start julia with

OPENBLAS_CORETYPE=NEOVERSEN1 julia

?

pablosanjose commented 10 months ago

@giordano, on what version? On 1.10 the OPENBLAS_CORETYPE=NEOVERSEN1 doesn't fix it, it still fails

On JuliaLang/julia#52928 there is no need to set OPENBLAS_CORETYPE, it works perfectly out of the box

giordano commented 10 months ago

on what version? On 1.10 the OPENBLAS_CORETYPE=NEOVERSEN1 doesn't fix it, it still fails

On 1.10.0, because it ships OpenBLAS v0.3.23 which still falls back on the generic armv8 kernels on Apple Silicon, but in v0.3.26 (what's used now in master, or 1.11-DEV) it automatically uses the neoversen1 kernels, which are much faster (so that setting OPENBLAS_CORETYPE=NEOVERSEN1 on julia v1.10 should usually be a bit faster than the default behaviour). In Julia v1.11 you don't need to set it because you already get that (as you can see with OPENBLAS_VERBOSE=2).

On JuliaLang/julia#52928 there is no need to set OPENBLAS_CORETYPE, it works perfectly out of the box

I'm a bit confused, the build in JuliaLang/julia#52928 shouldn't theoretically affect Apple Silicon, it doesn't have SVE, unless the problem was in the specific version of the compiler used.

pablosanjose commented 10 months ago

On 1.10.0, because it ships OpenBLAS v0.3.23 which still falls back on the generic armv8 kernels on Apple Silicon, but in v0.3.26 (what's used now in master, or 1.11-DEV) it automatically uses the neoversen1 kernels, which are much faster (so that setting OPENBLAS_CORETYPE=NEOVERSEN1 on julia v1.10 should usually be a bit faster than the default behaviour). In Julia v1.11 you don't need to set it because you already get that (as you can see with OPENBLAS_VERBOSE=2).

Ok, so your hunch is that setting the OPENBLAS_CORETYPE env var should be enough in 1.10? I can confirm that it is not, it still fails.

I'm a bit confused, the build in https://github.com/JuliaLang/julia/pull/52928 shouldn't theoretically affect Apple Silicon, it doesn't have SVE, unless the problem was in the specific version of the compiler used.

Me too. So I've tried JuliaLang/julia#52762 again, which failed for me before, but now it works (!). I might have forgotten to do a make cleanall the first time.

So at this point my understanding is that something in OpenBLAS v0.3.26 fixed the issue, but that it was still present with OpenBLAS v0.3.23, even with the OPENBLAS_CORETYPE=NEOVERSEN1 envvar.

giordano commented 10 months ago

Ok, so your hunch is that setting the OPENBLAS_CORETYPE env var should be enough in 1.10?

Not really, I was wondering if you were hitting a bug in the generic armv8 kernels, and switching to the neoversen1 ones would work around it, but apparently it isn't the case. Using OPENBLAS_CORETYPE=NEOVERSEN1 is still useful on 1.10 to get better speed, regardless of whether it fixes your bug.

So I've tried https://github.com/JuliaLang/julia/pull/52762 again, which failed for me before, but now it works (!). I might have forgotten to do a make cleanall the first time.

That's good, switching to neoversen1 kernels wasn't the only change between v0.3.23 and v0.3.26, the diff is pretty large, mine was a low-effort attempt to guess a possible solution, but perhaps there was some other bug somewhere else.

ViralBShah commented 10 months ago

@giordano will the openblas bump that is making its way through Yggdrasil fix the perf issue you mention here - even on 1.10?

pablosanjose commented 10 months ago

So I managed to bisect this (took a while!) and it turns out the fix was JuliaLang/julia#51660.

EDIT: FWIW the corresponding Julia version for the fix is 1.11.0-DEV.632

ViralBShah commented 10 months ago

Perhaps it is this? https://github.com/OpenMathLib/OpenBLAS/pull/4003

gbaraldi commented 10 months ago

That sounds very suspicious

giordano commented 10 months ago

@pablosanjose if you have a fortran compiler on your M1 (I don't) would you be able to bisect openblas between 0.3.23 and 0.3.24? You don't need to touch Julia, can use 1.10, you only need to use BLAS.lbt_forward to run blas functions using the just built library, building openblas takes relatively little.

giordano commented 10 months ago

@pablosanjose A script like this (untested!) should help for bisecting:

#!/bin/bash

make clean || true
make DYNAMIC_ARCH=1 LIBPREFIX=libopenblas64_ INTERFACE64=1 SYMBOLSUFFIX=64_ -j || exit 125

julia --startup-file=no -e '
using LinearAlgebra
BLAS.lbt_forward(realpath("./libopenblas64_.dylib"); clear=true)
function test()
    C = ComplexF32
    mat = zeros(C, 1, 1)
    for _ in 1:100
        v = [C(1-0.2im) C(2+0.3im)]
        mul!(mat, v, adjoint(v), C(1+im), 1)
    end
    return mat
end
exit(Int(!allequal(test() for _ in 1:100_000)))
'

You can run it with

git bisect start
git bisect bad v0.3.23
git bisect good v0.3.24
git bisect run ./script.sh

Side note, besides not having a Fortran compiler at hand, I can't even reproduce the issue on my MacBook Pro (13-inch, M1, 2020), macOS 12.6:

julia> using LinearAlgebra

julia> function test()
           C = ComplexF32
           mat = zeros(C, 1, 1)
           for _ in 1:100
               v = [C(1-0.2im) C(2+0.3im)]
               mul!(mat, v, v', C(1+im), 1)
           end
           return mat
       end
test (generic function with 1 method)

julia> allequal(test() for _ in 1:1_000_000)
true

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores

Same in Julia v1.9.4.

ViralBShah commented 10 months ago

FWIW, OpenBLAS 0.3.24 release notes do say there was a bug in CGEMM for Apple aarch64. Perhaps https://github.com/OpenMathLib/OpenBLAS/pull/4003 can be backported to 0.3.23 and we can test if this issue is still present. The issue is https://github.com/OpenMathLib/OpenBLAS/issues/3995 and seems like is only on M2 and higher.

pablosanjose commented 10 months ago

@giordano, thanks for the pointers, I'll try. It's a bit unsettling that you cannot reproduce on your machine, but your CPU is actually different. Mine's an M1 Max (macos 14.2.1) @ViralBShah , I've no idea, but https://github.com/OpenMathLib/OpenBLAS/pull/4003 includes M1 in the title, so it might still be the fix, let's see

pablosanjose commented 10 months ago

I'm having trouble compiling. It must be some stupid path or flag issue, but I'm stuck, so I'd appreciate some advice from anyone with some make experience. The invocation suggesed by @giordano

make clean
make DYNAMIC_ARCH=1 LIBPREFIX=libopenblas64_ INTERFACE64=1 SYMBOLSUFFIX=64_ -j

fails for me with these last lines

ranlib ../../../libopenblas64_p-r0.3.23.a
./gensymbol objconv arm64 _ 0 0  0 0 0 0 "" "64_" 1 0 1 1 1 1 > objconv.def
./gensymbol osx arm64 _ 0 0  0 0 0 0 "" "64_" 1 0 1 1 1 1 > osx.def
objconv @objconv.def ../libopenblas64_p-r0.3.23.a ../libopenblas64_p-r0.3.23.a.osx.renamed
make[1]: objconv: No such file or directory
make[1]: *** [../libopenblas64_p-r0.3.23.a.osx.renamed] Error 1
make: *** [shared] Error 2
pablosanjose commented 10 months ago

Gah, it was staring me in the face, objconv was not installed. Sorry for the noise.

giordano commented 10 months ago

Perhaps OpenMathLib/OpenBLAS#4003 can be backported to 0.3.23 and we can test if this issue is still present

That does look like a good candidate, but I believe having a successful bisect to directly identify the fix would be better than guessing, as well educated the guess is 🙂

pablosanjose commented 10 months ago

To save some time I focused directly on the two subsequent commits 51dd1339e and efcf71255 (the actual https://github.com/OpenMathLib/OpenBLAS/pull/4003), and I can confirm that in v1.10, the first one fails and the second one doesn't.

giordano commented 10 months ago

Awesome, thanks for confirming that's it! Now we can backport that single patch.

ViralBShah commented 10 months ago

@pablosanjose Will you be able to make a PR in Yggdrasil for OpenBLAS and OpenBLAS32 0.3.23?

pablosanjose commented 10 months ago

I'd love to try! Great learning experience for me, never dipped my toes into the Tree of Life itself :-D.

I'll do my best, and I'll ask for help if required.

giordano commented 10 months ago

In this case all you have to do is to place https://github.com/OpenMathLib/OpenBLAS/commit/efcf71255aaf0c636fc76e5235dde8bb9cf76469.patch in the O/OpenBLAS/OpenBLAS@0.3.23/bundled/patches and O/OpenBLAS/OpenBLAS32@0.3.23/bundled/patches directories.

pablosanjose commented 10 months ago

That simple? Ok, done!

giordano commented 10 months ago

Fixed by JuliaLang/julia#53074.