Incorrect result with `cblas_dgemv` vs reference netlib and other libraries

augusew1 commented 10 months ago

We recently switched to testing openBLAS on a project and are noticing some test case failures due to a matrix multiplication operation returning an incorrect result.

This issue has been observed on a variety of platforms (Ubuntu 22.04, RHEL7, RHEL9, MSYS2 mingw), a variety of compilers (clang-15, mingw-13, gcc-12, gcc-11, gcc-9), as well as a variety of openblas versions(0.3.3, 0.3.20, 0.3.21, 0.3.24), and a variety of CPUs:

Intel(R) Xeon(R) CPU E5-4650 0 @ 2.70GHz
Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
Intel(R) Core(TM) i7-9850H CPU @ 2.60GHz 2.59 GHz

Reproduction

I have attached a minimally reproducible example (in C++) showing the problem

Reproduction Code

```cpp #include #include #include #include constexpr static size_t size = 16; void get_values(double* A, double* b, const char* binfile) { std::fstream binaryReader; binaryReader.open(binfile, std::ios::in | std::ios::binary); assert(binaryReader.is_open()); for (size_t i = 0; i < (size * size); i++) { binaryReader.read(reinterpret_cast(&A[i]), sizeof(double)); } for (size_t i = 0; i < size; i++) { binaryReader.read(reinterpret_cast(&b[i]), sizeof(double)); } } void matrixMultiply(const double* A, const double* b, double* c) { const size_t N = size; for (size_t i = 0; i < N; ++i) { for (size_t j = 0; j < N; ++j){ c[i] += A[ (i * N) + j] * b[j]; } } } void CBlasMatrixMultiply(const double* a, const double* b, double* c){ int32_t M = size; int32_t N = size; cblas_dgemv(CblasRowMajor, CblasNoTrans, M, N, 1.0, a, N, b, 1, 1.0, c, 1); } int main(int argc, char* argv[]) { auto* A = new double[256](); auto* b = new double[16](); const char* binfile = argc > 1 ? argv[1] : BIN_FILE; get_values(A, b, binfile); auto* c_blas = new double[16](); CBlasMatrixMultiply(A, b, c_blas); auto *c_mat = new double[16](); matrixMultiply(A, b, c_mat); printf(" BLAS MAT\n"); for (size_t i = 0; i < size; i++) { printf("%0.18e\t%0.18e\n", c_blas[i], c_mat[i]); } // dot product? auto c_ddot = cblas_ddot(size, A + (size * 15), 1, b, 1); printf("ddot: %0.18e\n", c_ddot); printf("dgemv: %0.18e\n", c_blas[15]); printf("ddot == dgemv? %s\n", c_ddot == c_blas[15] ? "YES" : "NO"); printf("dgemv is 0.0? %s\n", c_blas[15] == 0.0 ? "YES" : "NO"); delete[] A; delete[] b; delete[] c_blas; delete[] c_mat; return 0; } ```

Compile this code with:

g++ -o blas_test blas_test.cpp -DBIN_FILE=\"path/to/bin\" $(pkg-config --libs --cflags openblas)

And observe the following output:

OpenBLAS result

``` BLAS MAT 1.175201193643801378e+00 1.175201193643801822e+00 1.103638323514327002e+00 1.103638323514327224e+00 3.578143506473725477e-01 3.578143506473724922e-01 7.045563366848892062e-02 7.045563366848883735e-02 9.965128148869371871e-03 9.965128148869309421e-03 1.099586127207577424e-03 1.099586127207556390e-03 9.945433911373591229e-05 9.945433911360671605e-05 7.620541308983597162e-06 7.620541308896932986e-06 5.064719745540013918e-07 5.064719744437437483e-07 2.971814122565419325e-08 2.971814140421127963e-08 1.560886642160141946e-09 1.560886564391797127e-09 7.419920233786569952e-11 7.419902813631537848e-11 3.221201083647429186e-12 3.221406076314455686e-12 1.292299600663682213e-13 1.289813102624490508e-13 4.440892098500626162e-15 4.440892098500626162e-15 0.000000000000000000e+00 -4.440892098500626162e-16 ddot: 0.000000000000000000e+00 dgemv: 0.000000000000000000e+00 ddot == dgemv? YES dgemv is 0.0? YES ```

It is worth noting that only the last value is different outside of acceptable numerical precision, and that every other value passes within 1e-16. Furthermore, a value of exactly 0.0 is, in itself, suspicious, as there's no real circumstance the value could be that.

Change the compile command to:

# cblas here is netlib
g++ -o blas_test blas_test.cpp -DBIN_FILE=\"path/to/bin\" $(pkg-config --libs --cflags cblas)

and observe this result:

netlib result

``` BLAS MAT 1.175201193643801822e+00 1.175201193643801822e+00 1.103638323514327224e+00 1.103638323514327224e+00 3.578143506473724922e-01 3.578143506473724922e-01 7.045563366848883735e-02 7.045563366848883735e-02 9.965128148869309421e-03 9.965128148869309421e-03 1.099586127207556390e-03 1.099586127207556390e-03 9.945433911360671605e-05 9.945433911360671605e-05 7.620541308896932986e-06 7.620541308896932986e-06 5.064719744437437483e-07 5.064719744437437483e-07 2.971814140421127963e-08 2.971814140421127963e-08 1.560886564391797127e-09 1.560886564391797127e-09 7.419902813631537848e-11 7.419902813631537848e-11 3.221406076314455686e-12 3.221406076314455686e-12 1.289813102624490508e-13 1.289813102624490508e-13 4.440892098500626162e-15 4.440892098500626162e-15 -4.440892098500626162e-16 -4.440892098500626162e-16 ddot: -4.440892098500626162e-16 dgemv: -4.440892098500626162e-16 ddot == dgemv? YES dgemv is 0.0? NO ```

Here is the binary file that contains a 16x16 matrix and a 16x1 vector: (NOTE: This is a binary data file, extension changed to make github happy) reproduction.txt

Other Notes

We have done extensive testing in other BLAS-like environments to get a result close to the expected -4e-16 result, which passes our test. Both MATLAB (2023a) and numpy (1.26 w/ MKL) return a result very close to what we expect, and pass our test. And, obviously, our naive matrix multiplication in the reproduction code gives

The matrix in question is not overly ill-conditioned, it has a condition number of ~10.

martin-frbg commented 10 months ago

Somewhat surprising as the three cpu generations would be using different optimized implementations of the GEMV BLAS kernel

martin-frbg commented 10 months ago

On first glance this appears to be some FMA-related effect from letting the compiler use AVX instructions - it is possible to obtain the netlib result by building for TARGET=GENERIC, but if I reconfigure any TARGET to use the same unoptimized, plain-C GEMV kernel without changing the compiler options in Makefile.x86_64, I end up with an "intermediate" result, although there is no other BLAS function in the call graph. (Unfortunately removing -mavx breaks some code that uses intrinsics, so it will take more time to confirm that suspicion - later.)

brada4 commented 10 months ago

It is one bit of precision off, very normal occurrence computing in different order.

augusew1 commented 10 months ago

I don't think that's right, this is roughly -2 ^ -51, and machine precision should be 2 ^ -53. Even if that is "one bit off", the result is exactly 0.0. Surely that's a bug.

If you do:

auto* c_blas = new double[16]();
c_blas[15] = 1.0e-18;
CBlasMatrixMultiply(A, b, c_blas);
printf("%s", c_blas[15] == 1.0e-18 ? "YES ": "NO");

Then it will return YES. It's not even changing the value. Our initial thought was that this was some sort of memory alignment issue due to the matrix size. We couldn't fix it with manual alignment to 8 bytes though.

brada4 commented 10 months ago

You abuse machine rounding precision 32 times (or 16 with FMA) , discount 5 bits in your check.

It is not magic symbolic computation soup that gives accurate poly result each time.

augusew1 commented 10 months ago

I do not expect an identical result, but this result is exactly the starting value of c_blas[15], every time, for every value. 32 (or 16) floating point operations and they all cancel out? Every time? For any starting value? Exactly? On multiple CPUs? This is the puzzle here, not why it's not exactly matching netlib/MKL.

brada4 commented 10 months ago

It is rounding to output precision to store in a register at every 1 or 2 FLOP-s 50% up 50% down and so lottery continues till the end of computation. Yes, workload splitting affects result.

brada4 commented 10 months ago

just a guess that intel uses generic code for small inputs, then gradually jumps to vector code and adds CPU threads as samples grow. Openblas uses vector code always and switxhes to all cpus at one point.

martin-frbg commented 10 months ago

Then it will return YES. It's not even changing the value. Our initial thought was that this was some sort of memory alignment issue due to the matrix size. We couldn't fix it with manual alignment to 8 bytes though.

As far as I can tell all terms cancel with the "right" evaluation order and the y[15] evaluates to "exact" zero within the limits of precision - as this is added to what the c_blas array initially contained (the "beta times y" of the GEMV equation, beta being one in your case), you see no change. There appears to be loss of precision in the AVX2 "microkernel" used for Haswell and newer due to operand size limitations in the instruction set. Certainly not ideal, but not a catastrophic failure either (which would certainly have shown up in testsuites during the almost ten years this code has been in place)

augusew1 commented 10 months ago

There appears to be loss of precision in the AVX2 "microkernel" used for Haswell and newer due to operand size limitations in the instruction set. Certainly not ideal, but not a catastrophic failure either (which would certainly have shown up in testsuites during the almost ten years this code has been in place)

I can't understand this conclusion given that I can reproduce this with an OPENBLAS_CORETYPE env that doesn't seem to use AVX2. Am I misunderstanding what this env does? Is AVX2 always checked even when an older CPU is manually set?

Here's what I'm testing on:

lscpu

``` Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 43 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 6 On-line CPU(s) list: 0-5 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz CPU family: 6 Model: 85 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 6 Stepping: 0 BogoMIPS: 5786.40 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc a cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall n x pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtop ology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic mov be popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 sme p bmi2 invpcid avx512f avx512dq rdseed adx smap clflush opt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xsa ves arat pku ospke md_clear flush_l1d arch_capabilities Virtualization features: Hypervisor vendor: VMware Virtualization type: full Caches (sum of all): L1d: 192 KiB (6 instances) L1i: 192 KiB (6 instances) L2: 6 MiB (6 instances) L3: 132 MiB (6 instances) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-5 Vulnerabilities: Gather data sampling: Unknown: Dependent on hypervisor status Itlb multihit: KVM: Mitigation: VMX unsupported L1tf: Mitigation; PTE Inversion Mds: Mitigation; Clear CPU buffers; SMT Host state unknown Meltdown: Mitigation; PTI Mmio stale data: Mitigation; Clear CPU buffers; SMT Host state unknown Retbleed: Mitigation; IBRS Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Srbds: Not affected Tsx async abort: Not affected ```

This is on Ubuntu 22.04, with the libopenblas-pthread-dev package. I can also recreate this on OpenBLAS 3.24.0 on MSYS2, but that machine is Coffee Lake and can't use SkylakeX instructions.

I'm using OPENBLAS_VERBOSE=5 to print the core type at the top

OPENBLAS_CORETYPE=Nehalem (No AVX2)

``` Core: Nehalem BLAS MAT 1.175201193643801378e+00 1.175201193643801822e+00 1.103638323514327002e+00 1.103638323514327224e+00 3.578143506473726032e-01 3.578143506473724922e-01 7.045563366848886511e-02 7.045563366848883735e-02 9.965128148869364932e-03 9.965128148869309421e-03 1.099586127207521913e-03 1.099586127207556390e-03 9.945433911356243994e-05 9.945433911360671605e-05 7.620541308817063708e-06 7.620541308896932986e-06 5.064719744429790893e-07 5.064719744437437483e-07 2.971814136443207133e-08 2.971814140421127963e-08 1.560886614404566330e-09 1.560886564391797127e-09 7.419920233786569952e-11 7.419902813631537848e-11 3.221423128252354218e-12 3.221406076314455686e-12 1.287858708565181587e-13 1.289813102624490508e-13 4.440892098500626162e-15 4.440892098500626162e-15 0.000000000000000000e+00 -4.440892098500626162e-16 ddot: 0.000000000000000000e+00 dgemv: 0.000000000000000000e+00 ddot == dgemv? YES dgemv is 0.0? YES ```

OPENBLAS_CORETYPE=SandyBridge (No AVX2)

``` Core: Sandybridge BLAS MAT 1.175201193643801378e+00 1.175201193643801822e+00 1.103638323514327002e+00 1.103638323514327224e+00 3.578143506473726032e-01 3.578143506473724922e-01 7.045563366848886511e-02 7.045563366848883735e-02 9.965128148869364932e-03 9.965128148869309421e-03 1.099586127207521913e-03 1.099586127207556390e-03 9.945433911356243994e-05 9.945433911360671605e-05 7.620541308817063708e-06 7.620541308896932986e-06 5.064719744429790893e-07 5.064719744437437483e-07 2.971814136443207133e-08 2.971814140421127963e-08 1.560886614404566330e-09 1.560886564391797127e-09 7.419920233786569952e-11 7.419902813631537848e-11 3.221423128252354218e-12 3.221406076314455686e-12 1.287858708565181587e-13 1.289813102624490508e-13 4.440892098500626162e-15 4.440892098500626162e-15 0.000000000000000000e+00 -4.440892098500626162e-16 ddot: 0.000000000000000000e+00 dgemv: 0.000000000000000000e+00 ddot == dgemv? YES dgemv is 0.0? YES ```

OPENBLAS_CORETYPE=Haswell

``` Core: Haswell BLAS MAT 1.175201193643801378e+00 1.175201193643801822e+00 1.103638323514327002e+00 1.103638323514327224e+00 3.578143506473725477e-01 3.578143506473724922e-01 7.045563366848892062e-02 7.045563366848883735e-02 9.965128148869371871e-03 9.965128148869309421e-03 1.099586127207577424e-03 1.099586127207556390e-03 9.945433911373591229e-05 9.945433911360671605e-05 7.620541308983597162e-06 7.620541308896932986e-06 5.064719745540013918e-07 5.064719744437437483e-07 2.971814122565419325e-08 2.971814140421127963e-08 1.560886642160141946e-09 1.560886564391797127e-09 7.419920233786569952e-11 7.419902813631537848e-11 3.221201083647429186e-12 3.221406076314455686e-12 1.292299600663682213e-13 1.289813102624490508e-13 4.440892098500626162e-15 4.440892098500626162e-15 0.000000000000000000e+00 -4.440892098500626162e-16 ddot: 0.000000000000000000e+00 dgemv: 0.000000000000000000e+00 ddot == dgemv? YES dgemv is 0.0? YES ```

OPENBLAS_CORETYPE=SkylakeX (AVX512)

``` Core: SkylakeX BLAS MAT 1.175201193643801378e+00 1.175201193643801822e+00 1.103638323514327002e+00 1.103638323514327224e+00 3.578143506473725477e-01 3.578143506473724922e-01 7.045563366848892062e-02 7.045563366848883735e-02 9.965128148869371871e-03 9.965128148869309421e-03 1.099586127207577424e-03 1.099586127207556390e-03 9.945433911373591229e-05 9.945433911360671605e-05 7.620541308983597162e-06 7.620541308896932986e-06 5.064719745540013918e-07 5.064719744437437483e-07 2.971814122565419325e-08 2.971814140421127963e-08 1.560886642160141946e-09 1.560886564391797127e-09 7.419920233786569952e-11 7.419902813631537848e-11 3.221201083647429186e-12 3.221406076314455686e-12 1.292299600663682213e-13 1.289813102624490508e-13 4.440892098500626162e-15 4.440892098500626162e-15 0.000000000000000000e+00 -4.440892098500626162e-16 ddot: -8.881784197001252323e-16 dgemv: 0.000000000000000000e+00 ddot == dgemv? NO dgemv is 0.0? YES ```

In my testing, I'm seeing that neither AVX2 or AVX matters, as Nehalem and SandyBridge give identical results. AVX2 on Haswell gives a slightly different result but one that's well within machine precision. The biggest difference I see is with SkylakeX and, presumably, AVX512 giving a different result for ddot only. This value is "correct" to us as it is within our test case, and matches very closely to what other BLAS libraries give.

Please help me understand how the AVX2 microkernel is the issue here. If OPENBLAS_CORETYPE does not actually control this, does OpenBLAS provide a way to disable AVX2 at runtime?

brada4 commented 10 months ago

It will fall back to older compute kernels if you do not have AVX2 in CPUID. The difference is 1-2 youngest bits of significand and is expected. If you want same result always you have to use unoptimized netlib build.

oscardssmith commented 1 month ago

The issue here is actually really simple. OpenBLAS for gemv isn't using FMA which would also be faster.

For some Julia code demonstrating this, see

A = [1.1 0 -1.1 0
       -1.1 0  1.1 0]
v = fill(1.33,4)

A*v # returns [0.0, 0.0]

using MKL
A*v # returns [1.0391687510491465e-16, -1.0391687510491465e-16]

function simple_gemv_fma(A, v)
    result=zeros(eltype(A), size(A, 1))
    for i = 1:size(A, 1)
        for j = 1:size(A, 2)
            result[i] = fma(A[i,j], v[j], result[i])
        end
    end
    return result
end
function simple_gemv(A, v)
    result=zeros(eltype(A), size(A, 1))
    for i = 1:size(A, 1)
        for j = 1:size(A, 2)
            result[i] = A[i,j] * v[j] + result[i]
        end
    end
    return result
end

simple_gemv(A, v) # returns [0.0, 0.0]
simple_gemv_fma(A, v) # returns [1.0391687510491465e-16, -1.0391687510491465e-16]

martin-frbg commented 1 month ago

The issue here is actually really simple. OpenBLAS for gemv isn't using FMA which

I don't think it is that simple, unless you meant to write "the reference BLAS isn't using FMA" which is trivially true. The OpenBLAS GEMV kernels for the cpus mentioned here all use FMA instructions, the question is if they could/should be rewritten to minimize the difference seen in that particular case.

oscardssmith commented 1 month ago

hmm. If OpenBLAS GEMV is using fma, what order is it running to match the naive loop without FMA's results? I saw that the results were the results of the obvious algorithm and assumed from there.

OpenMathLib / OpenBLAS

Incorrect result with `cblas_dgemv` vs reference netlib and other libraries #4324

Reproduction

Other Notes