BLAS.gemv! has different numerical results depending upon vector striding

JuliaLang / julia

The Julia Programming Language

https://julialang.org/

MIT License

45.73k stars 5.48k forks source link

BLAS.gemv! has different numerical results depending upon vector striding #52539

Open dlfivefifty opened 11 months ago

dlfivefifty commented 11 months ago

Here's a subtle change that is breaking LazyArrays.jl:

julia> versioninfo()
Julia Version 1.10.0-rc2
Commit dbb9c46795b (2023-12-03 15:25 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 8 × Apple M2
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 5 on 4 virtual cores
Environment:
  JULIA_NUM_THREADS = 4

julia> A = randn(5,5);

julia> b_in = view(randn(9),1:2:9);

julia> b = deepcopy(b_in); x = y = b; α,β  = 1.0,0.0;  BLAS.gemv!('T', α, Base.unalias(y,A), Base.unalias(y,x), β, y)
5-element view(::Vector{Float64}, 1:2:9) with eltype Float64:
 -0.9964593849905171
  0.08126992443103409
  2.283499644059368
  0.029178794443581683
  1.4407685328674973

julia> b = deepcopy(b_in); x = y = b; α,β  = 1.0,0.0;  BLAS.gemv!('T', α, A, y, β, similar(y))
5-element Vector{Float64}:
 -0.9964593849905169
  0.08126992443103415
  2.283499644059368
  0.029178794443581933
  1.4407685328674973

Note the second-to-last numbers differs.

This is not the case with Julia v1.9.

dlfivefifty commented 11 months ago

This discrepency is not present when I call using AppleAccelerate so it's presumably a change in OpenBLAS....

Probably it's not so critical so I can modify the tests in LazyArrays.jl but wanted to make sure it is recorded somewhere....

aravindh-krishnamoorthy commented 8 months ago

I can't reproduce this on my Intel PC with 1.11-DEV.1605 and a fresh start (no additional packages are loaded at startup). Not sure if the issue is Apple specific or resolved in the meantime:

julia> versioninfo()
Julia Version 1.11.0-DEV.1605
Commit 61c3521613 (2024-02-15 10:26 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, tigerlake)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

julia> include("deepcopy.jl")
DONE

deepcopy.jl

```julia using LinearAlgebra for i = 1:1000 A = randn(5,5); b_in = view(randn(9),1:2:9); b = deepcopy(b_in); x = y = b; α,β = 1.0,0.0; out1 = BLAS.gemv!('T', α, Base.unalias(y,A), Base.unalias(y,x), β, y) b = deepcopy(b_in); x = y = b; α,β = 1.0,0.0; out2 = BLAS.gemv!('T', α, A, y, β, similar(y)) if norm(out1 - out2) > 0 println("ERR") end end println("DONE") ```

mbauman commented 8 months ago

The crux of the difference, I think, is that when Base.unalias needs to make a copy of a view, it'll do so in an efficient manner (while still preserving type stability). This means that the stride of x changes from 2 to 1. And using similar(y) will give you a 1-strided Array instead of the 2-strided view. I suspect the numerical difference is due to a different codepath in BLAS.gemv! depending upon (at least one of) these strides. But it's surprising to me (and seems to be quite problematic) that it's different by 72 ULPs!

julia> x
5-element view(::Vector{Float64}, 1:2:9) with eltype Float64:
 -0.2064556821796735
  1.6006631775985414
 -1.653859231044385
 -1.1958166751129422
 -2.0804917307469575

julia> Base.unalias(y, x)
5-element view(::Vector{Float64}, 1:1:5) with eltype Float64:
 -0.2064556821796735
  1.6006631775985414
 -1.653859231044385
 -1.1958166751129422
 -2.0804917307469575

I can reproduce, but of course I got a different random seed — in the session below I have one value different by 1 ULP and another by 8.

reproduction

```julia julia> using LinearAlgebra, Random julia> Random.seed!(0) TaskLocalRNG() julia> A = randn(5,5); julia> b_in = view(randn(9),1:2:9); julia> b = deepcopy(b_in); x = y = b; α,β = 1.0,0.0; BLAS.gemv!('T', α, Base.unalias(y,A), Base.unalias(y,x), β, y) 5-element view(::Vector{Float64}, 1:2:9) with eltype Float64: 1.8049965913859851 -2.1584112377138487 1.136100764385711 -1.9574734522725015 -0.10447348570774537 julia> b = deepcopy(b_in); x = y = b; α,β = 1.0,0.0; BLAS.gemv!('T', α, A, y, β, similar(y)) 5-element Vector{Float64}: 1.804996591385985 -2.1584112377138487 1.136100764385711 -1.9574734522725015 -0.10447348570774548 julia> versioninfo() Julia Version 1.10.0 Commit 3120989f39b (2023-12-25 18:01 UTC) Build Info: Official https://julialang.org/ release Platform Info: OS: macOS (arm64-apple-darwin22.4.0) CPU: 8 × Apple M1 Pro WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1) Threads: 1 on 6 virtual cores ```

mbauman commented 8 months ago

Here's a simpler reproducer. The difference is whether b is 1- or 2-strided:

julia> using LinearAlgebra

julia> A = [-0.23190906957695134 1.1722448627021573 -0.26988492037793593 0.13457571047366457 -1.0745769949749375; 0.940390210248594 -1.696813019281842 0.5780036022342506 0.16554255091922548 -1.3819748518823003; 0.5967616711335215 -2.1161459685604527 1.1603442420756227 -1.0720974154467635 0.9518527642027133; 1.9978240930577937 0.5583656057835097 0.28788799558526296 -1.2601642953877346 -0.2922912657969977; -0.05156184220322609 -0.8647295755028067 -0.44119282303422414 0.196408003082906 0.23759838081412232];

julia> b_in = view([1.3433020156405961, -0.024246556222684942, 0.07889209530976744, 0.5786789477077278, 1.4676494758572833, -1.7664296632523324, 0.6087554476370254, -0.1401993137927991, 0.9637377932632719], 1:2:9);

julia> BLAS.gemv!('T', 1.0, A, b_in, 0.0, Array{Float64}(undef, 5))
5-element Vector{Float64}:
  1.804996591385985
 -2.1584112377138487
  1.136100764385711
 -1.9574734522725015
 -0.10447348570774548

julia> BLAS.gemv!('T', 1.0, A, copy(b_in), 0.0, Array{Float64}(undef, 5))
5-element Vector{Float64}:
  1.8049965913859851
 -2.1584112377138487
  1.136100764385711
 -1.9574734522725015
 -0.10447348570774537

julia> BLAS.gemv!('T', 1.0, A, view(copy(b_in), 1:1:5), 0.0, Array{Float64}(undef, 5))
5-element Vector{Float64}:
  1.8049965913859851
 -2.1584112377138487
  1.136100764385711
 -1.9574734522725015
 -0.10447348570774537

With the above reproducer, I can see that OpenBLAS has been doing this since Julia v1.8 — the above session is consistent across v1.8, v1.9, v1.10, and master.

aravindh-krishnamoorthy commented 8 months ago

Unfortunately, again, I cannot reproduce it even with the isolated code above. My CPU is "11th Gen Intel(R) Core(TM) i7-1165G7." I suspect that this is an Apple-only issue as both the above inconsistencies are on Apple M1.

Stride is a very good hint. It would also be interesting to see if this might be an order-of-evaluation issue, e.g., by restricting the number of BLAS threads to one.

Output as expected.

```julia julia> versioninfo() Julia Version 1.11.0-DEV.1605 Commit 61c3521613 (2024-02-15 10:26 UTC) Platform Info: OS: Linux (x86_64-linux-gnu) CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz WORD_SIZE: 64 LLVM: libLLVM-16.0.6 (ORCJIT, tigerlake) Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores) julia> using LinearAlgebra julia> BLAS.get_num_threads() 4 julia> A = [-0.23190906957695134 1.1722448627021573 -0.26988492037793593 0.13457571047366457 -1.0745769949749375; 0.940390210248594 -1.696813019281842 0.5780036022342506 0.16554255091922548 -1.3819748518823003; 0.5967616711335215 -2.1161459685604527 1.1603442420756227 -1.0720974154467635 0.9518527642027133; 1.9978240930577937 0.5583656057835097 0.28788799558526296 -1.2601642953877346 -0.2922912657969977; -0.05156184220322609 -0.8647295755028067 -0.44119282303422414 0.196408003082906 0.23759838081412232]; julia> b_in = view([1.3433020156405961, -0.024246556222684942, 0.07889209530976744, 0.5786789477077278, 1.4676494758572833, -1.7664296632523324, 0.6087554476370254, -0.1401993137927991, 0.9637377932632719], 1:2:9); julia> BLAS.gemv!('T', 1.0, A, b_in, 0.0, Array{Float64}(undef, 5)) 5-element Vector{Float64}: 1.8049965913859851 -2.1584112377138487 1.136100764385711 -1.9574734522725015 -0.10447348570774531 julia> BLAS.gemv!('T', 1.0, A, copy(b_in), 0.0, Array{Float64}(undef, 5)) 5-element Vector{Float64}: 1.8049965913859851 -2.1584112377138487 1.136100764385711 -1.9574734522725015 -0.10447348570774531 julia> BLAS.gemv!('T', 1.0, A, view(copy(b_in), 1:1:5), 0.0, Array{Float64}(undef, 5)) 5-element Vector{Float64}: 1.8049965913859851 -2.1584112377138487 1.136100764385711 -1.9574734522725015 -0.10447348570774531 ```

mbauman commented 8 months ago

Yeah, I'm sure this is happening somewhere inside the OpenBLAS assembly kernels, which will of course vary by arch. I still see the issue regardless of numbers of threads — as I'd expect — since a 5x5 matrix is too small to hit the multithreaded routines.

The real question is: is it a problem to have different numerical results depending upon array layout? It kinda makes sense at larger scales, no? Any sufficiently-smart BLAS would want to block things (and re-associate floating point arithmetic) differently if there were efficiencies to had in doing so, yeah? And some arbitrarily large re-striding of the same values certainly could cause such a thing.

Of course, I'm not sure if OpenBLAS is necessarily doing something smart or advantageous here, but it could be.

aravindh-krishnamoorthy commented 8 months ago

Yeah, I'm sure this is happening somewhere inside the OpenBLAS assembly kernels, which will of course vary by arch. I still see the issue regardless of numbers of threads — as I'd expect — since a 5x5 matrix is too small to hit the multithreaded routines.

[snip]

Of course, I'm not sure if OpenBLAS is necessarily doing something smart or advantageous here, but it could be.

I think your intuition is perfectly right. I am no expert in arm64, but I just had a quick look at this file+line: OpenBLAS/kernel/arm64/gemv_t.S#L244 and it does seem like the driver uses vectorised code KERNEL_F32+KERNEL_F4+KERNEL_F1 when INC_X is one and scalar code KERNEL_S1 for other values of INC_X. The order of evaluation differs between the two, which leads to different rounding and may explain the difference. Sadly, I don't have arm64 at my disposal to experiment further and validate my hunch.

On the other hand, the x86_64 version of gemv seems to go to great lengths to ensure the same order of evaluation irrespective of the underlying instructions.

The real question is: is it a problem to have different numerical results depending upon array layout? It kinda makes sense at larger scales, no? Any sufficiently-smart BLAS would want to block things (and re-associate floating point arithmetic) differently if there were efficiencies to had in doing so, yeah? And some arbitrarily large re-striding of the same values certainly could cause such a thing.

Unfortunately, I'm not sure what's the right option here concerning Julia. As you're aware, a lot of native Julia code was written to ensure such consistency. But in this case, my guess is that it might lead to performance issues?

dlfivefifty commented 8 months ago

I think the fact that x86 preserves order is suggesting it is indeed a bug. Though I agree it’s not so important

dlfivefifty commented 8 months ago

Here's another surprising inconsistency where LU solves seem to use a different implementation of forward/backward substitution than triangular solves:

julia> A = [0.1 1 1;
            2   4 8;
            1   4 9];

julia> L,U,σ = lu(A);

julia> b = [10,11,12];

julia> U\(L\b[σ]) == A\b
false

julia> using AppleAccelerate

julia> U\(L\b[σ]) == A\b
true