OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.33k stars 1.49k forks source link

Julia's tests fail with OpenBLAS on Intel CPUs without AVX #4176

Closed felixonmars closed 1 year ago

felixonmars commented 1 year ago

Hello, I am getting the following test failures in Julia (either 1.8 or 1.9):

$ julia /usr/share/julia/stdlib/v1.9/LinearAlgebra/test/blas.jl
...
axp(b)y!: Test Failed at /usr/share/julia/stdlib/v1.8/LinearAlgebra/test/blas.jl:702
  Expression: BLAS.axpy!(α, a, copy(b)) ≈ α * a + b
   Evaluated: ComplexF64[1.9330187934128453 - 8.970730564994865im, -0.6963896877405945 + 0.8611170231579576im, -0.6963896877405945 + 0.8611170231579576im, -0.6963896877405945 + 0.8611170231579576im, -0.6963896877405945 + 0.8611170231579576im, -0.6963896877405945 + 0.8611170231579576im, -0.6963896877405945 + 0.8611170231579576im, -0.6963896877405945 + 0.8611170231579576im, -0.6963896877405945 + 0.8611170231579576im, -0.6963896877405945 + 0.8611170231579576im] ≈ ComplexF64[-0.4334488396252506 - 0.12206773565732487im, -0.4334488396252506 - 0.12206773565732487im, -0.4334488396252506 - 0.12206773565732487im, -0.4334488396252506 - 0.12206773565732487im, -0.4334488396252506 - 0.12206773565732487im, -0.4334488396252506 - 0.12206773565732487im, -0.4334488396252506 - 0.12206773565732487im, -0.4334488396252506 - 0.12206773565732487im, -0.4334488396252506 - 0.12206773565732487im, -0.4334488396252506 - 0.12206773565732487im]

The same error is reproducible on NixOS or Arch, but only on machines without AVX capability.

Reproducible:

Not reproducible:

I have tried to set different OPENBLAS_CORETYPE but it never changed the result. I could get SIGILL immediately when setting it to something that the CPU doesn't support, though, so it did make a difference.

The error is gone when running Julia with Netlib blas, though. So my best guess for now is a problem in OpenBLAS.

martin-frbg commented 1 year ago

which version of OpenBLAS are you using ? (But anything without AVX is likely to use assembly kernels from the original GotoBLAS of ~15 years ago in any case). A non-Julia test case would be helpful, but the simplest solution is probably to retire the zaxpy_sse.S/zaxpy_sse2.S in the fallback x86_64/KERNEL config in favor of the generic C implementation in arm/zaxpy.c

felixonmars commented 1 year ago

It is using OpenBLAS 0.3.23 with 64-bit interface. I am not very familiar with Julia either and didn't come up with a non-Julia test case in trivial ways.

I'll first try to remove zaxpy_sse/zaxpy_sse2 then.

brada4 commented 1 year ago

am not very familiar with Julia

Then should not be a problem adding breakpoint to zaxpy call and extract argument N at least?

felixonmars commented 1 year ago

am not very familiar with Julia

Then should not be a problem adding breakpoint to zaxpy call and extract argument N at least?

Thanks for the pointer! I actually didn't know which call it was. Here it is:

Thread 1 "julia" hit Breakpoint 3.2, zaxpy_ (N=0x7fffffffa0d0, ALPHA=0x7fffffffa0c0, x=0x7fffe91b9080, INCX=0x7fffffffa0e0, y=0x7fffe8917ae0, INCY=0x7fffffffa0f0) at /usr/src/debug/openblas/OpenBLAS-0.3.23/interface/zaxpy.c:51
51      void NAME(blasint *N, FLOAT *ALPHA, FLOAT *x, blasint *INCX, FLOAT *y, blasint *INCY){
(gdb) p *N
$2 = 10
(gdb) p *ALPHA
$3 = 0.46587430189458123
(gdb) p *x
$4 = -1.5355251494850719
(gdb) p *INCX
$5 = 0
(gdb) p *y
$6 = 0.45417284108880224
(gdb) p *INCY
$7 = 1
felixonmars commented 1 year ago

I'll first try to remove zaxpy_sse/zaxpy_sse2 then.

I have applied the following change and the problem is gone:

$ git diff
diff --git a/kernel/x86_64/KERNEL b/kernel/x86_64/KERNEL
index bea7036c2..0bede7680 100644
--- a/kernel/x86_64/KERNEL
+++ b/kernel/x86_64/KERNEL
@@ -83,7 +83,7 @@ CAXPYKERNEL = zaxpy_sse.S
 endif

 ifndef ZAXPYKERNEL
-ZAXPYKERNEL = zaxpy_sse2.S
+ZAXPYKERNEL = zaxpy.c
 endif

 ifndef QAXPYKERNEL
felixonmars commented 1 year ago

I have tried further to manually revert the changes introduced to zaxpy_sse2.S in 0cfd29a819a7182111b6a4d63a653d011db561de, and this indeed fixed the problem.

diff --git a/kernel/x86_64/zaxpy_sse2.S b/kernel/x86_64/zaxpy_sse2.S
index a7dd054fb..aee8a8bcd 100644
--- a/kernel/x86_64/zaxpy_sse2.S
+++ b/kernel/x86_64/zaxpy_sse2.S
@@ -1416,12 +1416,6 @@

        movq    Y, YY
        movq    M,  %rax
-//If incx==0 || incy==0, avoid unloop and jump to end.
-       cmpq    $0, INCX
-       je  .L58
-       cmpq    $0, INCY
-       je      .L58
-
        sarq    $3, %rax
        jle     .L55

@@ -1775,7 +1769,6 @@
        andq    $1, %rax
        jle     .L999

-.L58:
        MOVDDUP( 0 * SIZE, X, %xmm0)
        MOVDDUP( 1 * SIZE, X, %xmm1)

@@ -1788,9 +1781,6 @@

        movlpd  %xmm8,   0 * SIZE(YY)
        movhpd  %xmm8,   1 * SIZE(YY)
-
-       decq %rax
-       jg      .L58
        ALIGN_3

 .L999:
brada4 commented 1 year ago

Cool

Harry-Chen commented 1 year ago

@felixonmars has run some tests. A very simple case is:

N = 2
alpha = 0.5 + 0i
A = [0.5 + 0i]
incx = 0
B = [0.5 + 0i, 0.5 + 0i]
incy = 1

With sse kernel we get [0.5 + 0i, 0.5 + 0i], with sse2 we get [1.0 + 0i, 0.5 + 0i].

This should be easy enough for you to debug.

martin-frbg commented 1 year ago

Nice, thanks - I had only gotten to confirming that zaxpy.c would indeed fix it. It was almost bound to be one of those corner cases, as all our usual tests (LAPACK testsuite and BLAS-Tester which is basically the original ATLAS testsuite) would not catch it. I'll add your test case to the "utest" collection later

martin-frbg commented 1 year ago

Now the question is if the change can really be reverted, or if that will bring back the (unfortunately poorly described) problem from #7 that led to its introduction. (I notice we already have an utest for ZAXPY, only it is testing the case where both X and Y increments are zero, for which the shortcut in zaxpy_sse2.S appears to be appropriate)

brada4 commented 1 year ago

Seems julia test puts repeated same values in inputs, so outcome has to be as level as that.

martin-frbg commented 1 year ago

Sorry, I do not understand what you just wrote ?

brada4 commented 1 year ago

If you look at julia test openblas result gets compared to julia interpreted calculation

input arrays have 10x same values each that even without julia result comparison it is clear that asm kernel messed it up.

martin-frbg commented 1 year ago

I don't think there's any doubt that the changes made for #7 were only correct for the case where both incx and incy are zero, the only question is whether it is safe to revert them completely, or if the assembly must be reworked so that a loop over incy is retained (which does not seem to be trivial)