Open augusew1 opened 10 months ago
Somewhat surprising as the three cpu generations would be using different optimized implementations of the GEMV BLAS kernel
On first glance this appears to be some FMA-related effect from letting the compiler use AVX instructions - it is possible to obtain the netlib result by building for TARGET=GENERIC, but if I reconfigure any TARGET to use the same unoptimized, plain-C GEMV kernel without changing the compiler options in Makefile.x86_64, I end up with an "intermediate" result,
although there is no other BLAS function in the call graph. (Unfortunately removing -mavx
breaks some code that uses intrinsics, so it will take more time to confirm that suspicion - later.)
It is one bit of precision off, very normal occurrence computing in different order.
I don't think that's right, this is roughly -2 ^ -51
, and machine precision should be 2 ^ -53
. Even if that is "one bit off", the result is exactly 0.0
. Surely that's a bug.
If you do:
auto* c_blas = new double[16]();
c_blas[15] = 1.0e-18;
CBlasMatrixMultiply(A, b, c_blas);
printf("%s", c_blas[15] == 1.0e-18 ? "YES ": "NO");
Then it will return YES
. It's not even changing the value. Our initial thought was that this was some sort of memory alignment issue due to the matrix size. We couldn't fix it with manual alignment to 8 bytes though.
You abuse machine rounding precision 32 times (or 16 with FMA) , discount 5 bits in your check.
It is not magic symbolic computation soup that gives accurate poly result each time.
I do not expect an identical result, but this result is exactly the starting value of c_blas[15]
, every time, for every value. 32 (or 16) floating point operations and they all cancel out? Every time? For any starting value? Exactly? On multiple CPUs? This is the puzzle here, not why it's not exactly matching netlib/MKL.
It is rounding to output precision to store in a register at every 1 or 2 FLOP-s 50% up 50% down and so lottery continues till the end of computation. Yes, workload splitting affects result.
just a guess that intel uses generic code for small inputs, then gradually jumps to vector code and adds CPU threads as samples grow. Openblas uses vector code always and switxhes to all cpus at one point.
Then it will return
YES
. It's not even changing the value. Our initial thought was that this was some sort of memory alignment issue due to the matrix size. We couldn't fix it with manual alignment to 8 bytes though.
As far as I can tell all terms cancel with the "right" evaluation order and the y[15] evaluates to "exact" zero within the limits of precision - as this is added to what the c_blas array initially contained (the "beta times y" of the GEMV equation, beta being one in your case), you see no change. There appears to be loss of precision in the AVX2 "microkernel" used for Haswell and newer due to operand size limitations in the instruction set. Certainly not ideal, but not a catastrophic failure either (which would certainly have shown up in testsuites during the almost ten years this code has been in place)
There appears to be loss of precision in the AVX2 "microkernel" used for Haswell and newer due to operand size limitations in the instruction set. Certainly not ideal, but not a catastrophic failure either (which would certainly have shown up in testsuites during the almost ten years this code has been in place)
I can't understand this conclusion given that I can reproduce this with an OPENBLAS_CORETYPE
env that doesn't seem to use AVX2. Am I misunderstanding what this env does? Is AVX2 always checked even when an older CPU is manually set?
Here's what I'm testing on:
This is on Ubuntu 22.04, with the libopenblas-pthread-dev
package. I can also recreate this on OpenBLAS 3.24.0 on MSYS2, but that machine is Coffee Lake and can't use SkylakeX instructions.
I'm using OPENBLAS_VERBOSE=5
to print the core type at the top
In my testing, I'm seeing that neither AVX2 or AVX matters, as Nehalem and SandyBridge give identical results. AVX2 on Haswell gives a slightly different result but one that's well within machine precision. The biggest difference I see is with SkylakeX and, presumably, AVX512 giving a different result for ddot
only. This value is "correct" to us as it is within our test case, and matches very closely to what other BLAS libraries give.
Please help me understand how the AVX2 microkernel is the issue here. If OPENBLAS_CORETYPE
does not actually control this, does OpenBLAS provide a way to disable AVX2 at runtime?
It will fall back to older compute kernels if you do not have AVX2 in CPUID. The difference is 1-2 youngest bits of significand and is expected. If you want same result always you have to use unoptimized netlib build.
The issue here is actually really simple. OpenBLAS for gemv isn't using FMA which would also be faster.
For some Julia code demonstrating this, see
A = [1.1 0 -1.1 0
-1.1 0 1.1 0]
v = fill(1.33,4)
A*v # returns [0.0, 0.0]
using MKL
A*v # returns [1.0391687510491465e-16, -1.0391687510491465e-16]
function simple_gemv_fma(A, v)
result=zeros(eltype(A), size(A, 1))
for i = 1:size(A, 1)
for j = 1:size(A, 2)
result[i] = fma(A[i,j], v[j], result[i])
end
end
return result
end
function simple_gemv(A, v)
result=zeros(eltype(A), size(A, 1))
for i = 1:size(A, 1)
for j = 1:size(A, 2)
result[i] = A[i,j] * v[j] + result[i]
end
end
return result
end
simple_gemv(A, v) # returns [0.0, 0.0]
simple_gemv_fma(A, v) # returns [1.0391687510491465e-16, -1.0391687510491465e-16]
The issue here is actually really simple. OpenBLAS for gemv isn't using FMA which
I don't think it is that simple, unless you meant to write "the reference BLAS isn't using FMA" which is trivially true. The OpenBLAS GEMV kernels for the cpus mentioned here all use FMA instructions, the question is if they could/should be rewritten to minimize the difference seen in that particular case.
hmm. If OpenBLAS GEMV is using fma, what order is it running to match the naive loop without FMA's results? I saw that the results were the results of the obvious algorithm and assumed from there.
We recently switched to testing openBLAS on a project and are noticing some test case failures due to a matrix multiplication operation returning an incorrect result.
This issue has been observed on a variety of platforms (Ubuntu 22.04, RHEL7, RHEL9, MSYS2 mingw), a variety of compilers (clang-15, mingw-13, gcc-12, gcc-11, gcc-9), as well as a variety of openblas versions(0.3.3, 0.3.20, 0.3.21, 0.3.24), and a variety of CPUs:
Reproduction
I have attached a minimally reproducible example (in C++) showing the problem
Reproduction Code
```cpp #includeCompile this code with:
And observe the following output:
OpenBLAS result
``` BLAS MAT 1.175201193643801378e+00 1.175201193643801822e+00 1.103638323514327002e+00 1.103638323514327224e+00 3.578143506473725477e-01 3.578143506473724922e-01 7.045563366848892062e-02 7.045563366848883735e-02 9.965128148869371871e-03 9.965128148869309421e-03 1.099586127207577424e-03 1.099586127207556390e-03 9.945433911373591229e-05 9.945433911360671605e-05 7.620541308983597162e-06 7.620541308896932986e-06 5.064719745540013918e-07 5.064719744437437483e-07 2.971814122565419325e-08 2.971814140421127963e-08 1.560886642160141946e-09 1.560886564391797127e-09 7.419920233786569952e-11 7.419902813631537848e-11 3.221201083647429186e-12 3.221406076314455686e-12 1.292299600663682213e-13 1.289813102624490508e-13 4.440892098500626162e-15 4.440892098500626162e-15 0.000000000000000000e+00 -4.440892098500626162e-16 ddot: 0.000000000000000000e+00 dgemv: 0.000000000000000000e+00 ddot == dgemv? YES dgemv is 0.0? YES ```It is worth noting that only the last value is different outside of acceptable numerical precision, and that every other value passes within
1e-16
. Furthermore, a value of exactly0.0
is, in itself, suspicious, as there's no real circumstance the value could be that.Change the compile command to:
and observe this result:
netlib result
``` BLAS MAT 1.175201193643801822e+00 1.175201193643801822e+00 1.103638323514327224e+00 1.103638323514327224e+00 3.578143506473724922e-01 3.578143506473724922e-01 7.045563366848883735e-02 7.045563366848883735e-02 9.965128148869309421e-03 9.965128148869309421e-03 1.099586127207556390e-03 1.099586127207556390e-03 9.945433911360671605e-05 9.945433911360671605e-05 7.620541308896932986e-06 7.620541308896932986e-06 5.064719744437437483e-07 5.064719744437437483e-07 2.971814140421127963e-08 2.971814140421127963e-08 1.560886564391797127e-09 1.560886564391797127e-09 7.419902813631537848e-11 7.419902813631537848e-11 3.221406076314455686e-12 3.221406076314455686e-12 1.289813102624490508e-13 1.289813102624490508e-13 4.440892098500626162e-15 4.440892098500626162e-15 -4.440892098500626162e-16 -4.440892098500626162e-16 ddot: -4.440892098500626162e-16 dgemv: -4.440892098500626162e-16 ddot == dgemv? YES dgemv is 0.0? NO ```Here is the binary file that contains a 16x16 matrix and a 16x1 vector: (NOTE: This is a binary data file, extension changed to make github happy) reproduction.txt
Other Notes
We have done extensive testing in other BLAS-like environments to get a result close to the expected
-4e-16
result, which passes our test. Both MATLAB (2023a) and numpy (1.26 w/ MKL) return a result very close to what we expect, and pass our test. And, obviously, our naive matrix multiplication in the reproduction code givesThe matrix in question is not overly ill-conditioned, it has a condition number of ~10.