Open dlfivefifty opened 11 months ago
This discrepency is not present when I call using AppleAccelerate
so it's presumably a change in OpenBLAS....
Probably it's not so critical so I can modify the tests in LazyArrays.jl but wanted to make sure it is recorded somewhere....
I can't reproduce this on my Intel PC with 1.11-DEV.1605
and a fresh start (no additional packages are loaded at startup). Not sure if the issue is Apple specific or resolved in the meantime:
julia> versioninfo()
Julia Version 1.11.0-DEV.1605
Commit 61c3521613 (2024-02-15 10:26 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, tigerlake)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
julia> include("deepcopy.jl")
DONE
The crux of the difference, I think, is that when Base.unalias
needs to make a copy of a view, it'll do so in an efficient manner (while still preserving type stability). This means that the stride of x
changes from 2 to 1. And using similar(y)
will give you a 1-strided Array
instead of the 2-strided view. I suspect the numerical difference is due to a different codepath in BLAS.gemv!
depending upon (at least one of) these strides. But it's surprising to me (and seems to be quite problematic) that it's different by 72 ULPs!
julia> x
5-element view(::Vector{Float64}, 1:2:9) with eltype Float64:
-0.2064556821796735
1.6006631775985414
-1.653859231044385
-1.1958166751129422
-2.0804917307469575
julia> Base.unalias(y, x)
5-element view(::Vector{Float64}, 1:1:5) with eltype Float64:
-0.2064556821796735
1.6006631775985414
-1.653859231044385
-1.1958166751129422
-2.0804917307469575
I can reproduce, but of course I got a different random seed — in the session below I have one value different by 1 ULP and another by 8.
Here's a simpler reproducer. The difference is whether b
is 1- or 2-strided:
julia> using LinearAlgebra
julia> A = [-0.23190906957695134 1.1722448627021573 -0.26988492037793593 0.13457571047366457 -1.0745769949749375; 0.940390210248594 -1.696813019281842 0.5780036022342506 0.16554255091922548 -1.3819748518823003; 0.5967616711335215 -2.1161459685604527 1.1603442420756227 -1.0720974154467635 0.9518527642027133; 1.9978240930577937 0.5583656057835097 0.28788799558526296 -1.2601642953877346 -0.2922912657969977; -0.05156184220322609 -0.8647295755028067 -0.44119282303422414 0.196408003082906 0.23759838081412232];
julia> b_in = view([1.3433020156405961, -0.024246556222684942, 0.07889209530976744, 0.5786789477077278, 1.4676494758572833, -1.7664296632523324, 0.6087554476370254, -0.1401993137927991, 0.9637377932632719], 1:2:9);
julia> BLAS.gemv!('T', 1.0, A, b_in, 0.0, Array{Float64}(undef, 5))
5-element Vector{Float64}:
1.804996591385985
-2.1584112377138487
1.136100764385711
-1.9574734522725015
-0.10447348570774548
julia> BLAS.gemv!('T', 1.0, A, copy(b_in), 0.0, Array{Float64}(undef, 5))
5-element Vector{Float64}:
1.8049965913859851
-2.1584112377138487
1.136100764385711
-1.9574734522725015
-0.10447348570774537
julia> BLAS.gemv!('T', 1.0, A, view(copy(b_in), 1:1:5), 0.0, Array{Float64}(undef, 5))
5-element Vector{Float64}:
1.8049965913859851
-2.1584112377138487
1.136100764385711
-1.9574734522725015
-0.10447348570774537
With the above reproducer, I can see that OpenBLAS has been doing this since Julia v1.8 — the above session is consistent across v1.8, v1.9, v1.10, and master.
Unfortunately, again, I cannot reproduce it even with the isolated code above. My CPU is "11th Gen Intel(R) Core(TM) i7-1165G7." I suspect that this is an Apple-only issue as both the above inconsistencies are on Apple M1.
Stride is a very good hint. It would also be interesting to see if this might be an order-of-evaluation issue, e.g., by restricting the number of BLAS threads to one.
Yeah, I'm sure this is happening somewhere inside the OpenBLAS assembly kernels, which will of course vary by arch. I still see the issue regardless of numbers of threads — as I'd expect — since a 5x5 matrix is too small to hit the multithreaded routines.
The real question is: is it a problem to have different numerical results depending upon array layout? It kinda makes sense at larger scales, no? Any sufficiently-smart BLAS would want to block things (and re-associate floating point arithmetic) differently if there were efficiencies to had in doing so, yeah? And some arbitrarily large re-striding of the same values certainly could cause such a thing.
Of course, I'm not sure if OpenBLAS is necessarily doing something smart or advantageous here, but it could be.
Yeah, I'm sure this is happening somewhere inside the OpenBLAS assembly kernels, which will of course vary by arch. I still see the issue regardless of numbers of threads — as I'd expect — since a 5x5 matrix is too small to hit the multithreaded routines.
[snip]
Of course, I'm not sure if OpenBLAS is necessarily doing something smart or advantageous here, but it could be.
I think your intuition is perfectly right. I am no expert in arm64
, but I just had a quick look at this file+line: OpenBLAS/kernel/arm64/gemv_t.S#L244 and it does seem like the driver uses vectorised code KERNEL_F32+KERNEL_F4+KERNEL_F1
when INC_X
is one and scalar code KERNEL_S1
for other values of INC_X
. The order of evaluation differs between the two, which leads to different rounding and may explain the difference. Sadly, I don't have arm64
at my disposal to experiment further and validate my hunch.
On the other hand, the x86_64
version of gemv
seems to go to great lengths to ensure the same order of evaluation irrespective of the underlying instructions.
The real question is: is it a problem to have different numerical results depending upon array layout? It kinda makes sense at larger scales, no? Any sufficiently-smart BLAS would want to block things (and re-associate floating point arithmetic) differently if there were efficiencies to had in doing so, yeah? And some arbitrarily large re-striding of the same values certainly could cause such a thing.
Unfortunately, I'm not sure what's the right option here concerning Julia. As you're aware, a lot of native Julia code was written to ensure such consistency. But in this case, my guess is that it might lead to performance issues?
I think the fact that x86 preserves order is suggesting it is indeed a bug. Though I agree it’s not so important
Here's another surprising inconsistency where LU solves seem to use a different implementation of forward/backward substitution than triangular solves:
julia> A = [0.1 1 1;
2 4 8;
1 4 9];
julia> L,U,σ = lu(A);
julia> b = [10,11,12];
julia> U\(L\b[σ]) == A\b
false
julia> using AppleAccelerate
julia> U\(L\b[σ]) == A\b
true
Here's a subtle change that is breaking LazyArrays.jl:
Note the second-to-last numbers differs.
This is not the case with Julia v1.9.