Closed felixonmars closed 1 year ago
which version of OpenBLAS are you using ? (But anything without AVX is likely to use assembly kernels from the original GotoBLAS of ~15 years ago in any case). A non-Julia test case would be helpful, but the simplest solution is probably to retire the zaxpy_sse.S/zaxpy_sse2.S in the fallback x86_64/KERNEL config in favor of the generic C implementation in arm/zaxpy.c
It is using OpenBLAS 0.3.23 with 64-bit interface. I am not very familiar with Julia either and didn't come up with a non-Julia test case in trivial ways.
I'll first try to remove zaxpy_sse
/zaxpy_sse2
then.
am not very familiar with Julia
Then should not be a problem adding breakpoint to zaxpy call and extract argument N at least?
am not very familiar with Julia
Then should not be a problem adding breakpoint to zaxpy call and extract argument N at least?
Thanks for the pointer! I actually didn't know which call it was. Here it is:
Thread 1 "julia" hit Breakpoint 3.2, zaxpy_ (N=0x7fffffffa0d0, ALPHA=0x7fffffffa0c0, x=0x7fffe91b9080, INCX=0x7fffffffa0e0, y=0x7fffe8917ae0, INCY=0x7fffffffa0f0) at /usr/src/debug/openblas/OpenBLAS-0.3.23/interface/zaxpy.c:51
51 void NAME(blasint *N, FLOAT *ALPHA, FLOAT *x, blasint *INCX, FLOAT *y, blasint *INCY){
(gdb) p *N
$2 = 10
(gdb) p *ALPHA
$3 = 0.46587430189458123
(gdb) p *x
$4 = -1.5355251494850719
(gdb) p *INCX
$5 = 0
(gdb) p *y
$6 = 0.45417284108880224
(gdb) p *INCY
$7 = 1
I'll first try to remove
zaxpy_sse
/zaxpy_sse2
then.
I have applied the following change and the problem is gone:
$ git diff
diff --git a/kernel/x86_64/KERNEL b/kernel/x86_64/KERNEL
index bea7036c2..0bede7680 100644
--- a/kernel/x86_64/KERNEL
+++ b/kernel/x86_64/KERNEL
@@ -83,7 +83,7 @@ CAXPYKERNEL = zaxpy_sse.S
endif
ifndef ZAXPYKERNEL
-ZAXPYKERNEL = zaxpy_sse2.S
+ZAXPYKERNEL = zaxpy.c
endif
ifndef QAXPYKERNEL
I have tried further to manually revert the changes introduced to zaxpy_sse2.S in 0cfd29a819a7182111b6a4d63a653d011db561de, and this indeed fixed the problem.
diff --git a/kernel/x86_64/zaxpy_sse2.S b/kernel/x86_64/zaxpy_sse2.S
index a7dd054fb..aee8a8bcd 100644
--- a/kernel/x86_64/zaxpy_sse2.S
+++ b/kernel/x86_64/zaxpy_sse2.S
@@ -1416,12 +1416,6 @@
movq Y, YY
movq M, %rax
-//If incx==0 || incy==0, avoid unloop and jump to end.
- cmpq $0, INCX
- je .L58
- cmpq $0, INCY
- je .L58
-
sarq $3, %rax
jle .L55
@@ -1775,7 +1769,6 @@
andq $1, %rax
jle .L999
-.L58:
MOVDDUP( 0 * SIZE, X, %xmm0)
MOVDDUP( 1 * SIZE, X, %xmm1)
@@ -1788,9 +1781,6 @@
movlpd %xmm8, 0 * SIZE(YY)
movhpd %xmm8, 1 * SIZE(YY)
-
- decq %rax
- jg .L58
ALIGN_3
.L999:
Cool
@felixonmars has run some tests. A very simple case is:
N = 2
alpha = 0.5 + 0i
A = [0.5 + 0i]
incx = 0
B = [0.5 + 0i, 0.5 + 0i]
incy = 1
With sse
kernel we get [0.5 + 0i, 0.5 + 0i]
, with sse2
we get [1.0 + 0i, 0.5 + 0i]
.
This should be easy enough for you to debug.
Nice, thanks - I had only gotten to confirming that zaxpy.c would indeed fix it. It was almost bound to be one of those corner cases, as all our usual tests (LAPACK testsuite and BLAS-Tester which is basically the original ATLAS testsuite) would not catch it. I'll add your test case to the "utest" collection later
Now the question is if the change can really be reverted, or if that will bring back the (unfortunately poorly described) problem from #7 that led to its introduction. (I notice we already have an utest for ZAXPY, only it is testing the case where both X and Y increments are zero, for which the shortcut in zaxpy_sse2.S appears to be appropriate)
Seems julia test puts repeated same values in inputs, so outcome has to be as level as that.
Sorry, I do not understand what you just wrote ?
If you look at julia test openblas result gets compared to julia interpreted calculation
input arrays have 10x same values each that even without julia result comparison it is clear that asm kernel messed it up.
I don't think there's any doubt that the changes made for #7 were only correct for the case where both incx and incy are zero, the only question is whether it is safe to revert them completely, or if the assembly must be reworked so that a loop over incy is retained (which does not seem to be trivial)
Hello, I am getting the following test failures in Julia (either 1.8 or 1.9):
The same error is reproducible on NixOS or Arch, but only on machines without AVX capability.
Reproducible:
Not reproducible:
I have tried to set different
OPENBLAS_CORETYPE
but it never changed the result. I could get SIGILL immediately when setting it to something that the CPU doesn't support, though, so it did make a difference.The error is gone when running Julia with Netlib blas, though. So my best guess for now is a problem in OpenBLAS.