Closed RevySR closed 1 year ago
actually OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./sblat2 < ./sblat2.dat timeout
Attach gdb to process and at least get backtrace:
$ script
$ gdb
gdb> attach <pid>
gdb> t a a bt
gdb> detach
gdb> quit
$ ^D
qemu-user still timeout
Attach gdb to process and at least get backtrace:
$ script $ gdb gdb> attach <pid> gdb> t a a bt gdb> detach gdb> quit $ ^D
(gdb) bt
#0 sdot_k (n=n@entry=0, x=x@entry=0x3ff4967d78, inc_x=inc_x@entry=1, y=y@entry=0x3ff4963140, inc_y=inc_y@entry=1)
at /usr/lib/gcc/riscv64-linux-gnu/10/include/riscv_vector.h:954
#1 0x0000002addb94072 in ssbmv_U (n=n@entry=1, k=k@entry=0, alpha=alpha@entry=1, a=a@entry=0x3ff4967d78, lda=lda@entry=2,
x=x@entry=0x3ff4963140, incx=incx@entry=1, y=y@entry=0x3ff4963550, incy=incy@entry=1, buffer=buffer@entry=0x3f8d1c4000)
at sbmv_k.c:77
#2 0x0000002addb92fd4 in ssbmv_ (UPLO=UPLO@entry=0x3ff49626e0 "U9\226\364?", N=N@entry=0x3ff4962454, K=K@entry=0x3ff4962440,
ALPHA=<optimized out>, a=a@entry=0x3ff4967d78, LDA=LDA@entry=0x3ff4962448, x=x@entry=0x3ff4963140, INCX=<optimized out>,
BETA=0x3ff4962428, y=y@entry=0x3ff4963550, INCY=0x3ff496243c) at sbmv.c:206
#3 0x0000002addb8c81e in schk2 (sname=..., eps=1.1920929e-07, thresh=16, nout=6, ntra=-1, trace=.FALSE., rewi=.FALSE., fatal=.FALSE.,
nidim=7, idim=..., nkb=4, kb=..., nalf=3, alf=..., nbet=3, bet=..., ninc=4, inc=..., nmax=65, incmax=2, a=..., aa=..., as=...,
x=..., xx=..., xs=..., y=..., yy=..., ys=..., yt=..., g=..., _sname=_sname@entry=6) at sblat2.f:955
#4 0x0000002addb919ba in sblat2 () at sblat2.f:350
#5 0x0000002addb8790a in main (argc=<optimized out>, argv=<optimized out>) at sblat2.f:429
#6 0x0000003f8f20aad6 in __gconv_transform_internal_ascii (step=0x2addb87888 <__memset_chk@plt+8>, data=0xffffffffffffffd1,
inptrp=<optimized out>, inend=<optimized out>, outbufstart=<optimized out>, irreversible=0x26a62951766, do_flush=<optimized out>,
consume_incomplete=<optimized out>) at ../iconv/skeleton.c:501
Backtrace stopped: frame did not save the PC
Thanks. I'm seeing the hangs as well (in qemu), but I'm not convinced sdot_k itself (dot_vector.c) is at fault as sblat1 runs without error. (Not really sure if this ever worked, but most likely I made a mistake in my adaption of the old code. I just barely got my Lichee RV (C906-based) to work and it appears to be even slower than cross-compiler&qemu and less stable...)
Looks to be an optimizer thing with the Xuantie compiler at least - the BLAS2 tests run fine when everything is compiled at -O0 -g
(Same for BLAS3)
Looks to be an optimizer thing with the Xuantie compiler at least - the BLAS2 tests run fine when everything is compiled at
-O0 -g
(Same for BLAS3)
OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./sblat2 < ./sblat2.dat In -O0 optimize, the test case is passed. tested on milk-v.
and -O1
also works... I'm now trying with -O1
for the LAPACK (fortran) parts only
Yes, works (change in Makefile.system around line 1614:
- LAPACK_FFLAGS := $(FFLAGS)
+LAPACK_FFLAGS := $(filter-out -O2,$(FFLAGS))
+LAPACK_FFLAGS += -O1
(now also the potrf tests from openblas_utest do not hang anymore.) Remaining oddities: dblat1 test reports a failure in IDAMAX (although it uses basically the same source file as ISAMAX, iamax_vector.c). And utest reports a loss of precision in DSDOT.
Yes, works (change in Makefile.system around line 1614:
- LAPACK_FFLAGS := $(FFLAGS) +LAPACK_FFLAGS := $(filter-out -O2,$(FFLAGS)) +LAPACK_FFLAGS += -O1
(now also the potrf tests from openblas_utest do not hang anymore.) Remaining oddities: dblat1 test reports a failure in IDAMAX (although it uses basically the same source file as ISAMAX, iamax_vector.c). And utest reports a loss of precision in DSDOT.
In Debian CORE=C910V & thead gcc:
CFLAGS | FFLAGS | result |
---|---|---|
O2 | O2 | ❌ Timeout |
O2 | O1 | ❌ Timeout |
O1 | O2 | ✅ Passed |
O1 | O1 | ✅ Passed |
I think the code added in Makefile.system should not work.
Pity, it works for me (in Ubuntu 22, cross-compiling from x86_64 and testing in qemu) - I have pretty much given up trying to compile on the LicheeRV directly. Must be some difference between the gcc-10.2 from the Xuantie 2.6.1 cross toolchain and your native riscv64 build of gcc 10.4
This issue is caused by the implementation of the sdot_k function in the kernel/riscv64/dot_vector.c file.
#if defined(DSDOT)
double CNAME(BLASLONG n, FLOAT *x, BLASLONG inc_x, FLOAT *y, BLASLONG inc_y)
#else
FLOAT CNAME(BLASLONG n, FLOAT *x, BLASLONG inc_x, FLOAT *y, BLASLONG inc_y)
#endif
{
BLASLONG i=0, j=0;
double dot = 0.0 ;
if ( n < 0 ) return(dot);
FLOAT_V_T vr, vx, vy;
unsigned int gvl = 0;
FLOAT_V_T_M1 v_res, v_z0;
gvl = VSETVL_MAX;
v_res = VFMVVF_FLOAT_M1(0, gvl);
v_z0 = VFMVVF_FLOAT_M1(0, gvl);
if(inc_x == 1 && inc_y == 1){
gvl = VSETVL(n);
vr = VFMVVF_FLOAT(0, gvl);
for(i=0,j=0; i<n/gvl; i++){
vx = VLEV_FLOAT(&x[j], gvl);
vy = VLEV_FLOAT(&y[j], gvl);
vr = VFMACCVV_FLOAT(vr, vx, vy, gvl);
j += gvl;
}
......
}
The parameter n
may be zero, and when n
is zero, gvl
is also zero. This will cause the loop for(i=0,j=0; i<n/gvl; i++)
to go into an infinite loop. Because the result of executing 0/0 in qemu-user is -1 without throwing a divide-by-zero exception, which is a bit strange but it is the case. I don't know if this was deliberately designed this way or if there are other reasons for it.
So I think that this piece of code lacks handling for the case where n
is zero, which leads to this issue.
Due to optimization, there is a difference in the code generation at the loop header between O2 and O1. The purpose of the loop header is to check whether n/gvl
is less than or equal to zero, so that the loop for(i=0,j=0; i<n/gvl; i++)
can be skipped directly.
The codegen by O1 is
div t3,a0,a5
ble t3,zero,.L23
The codegen by O2 is
div t4,a0,t1
beq t4,zero,.L23
Since it can be inferred from the context that n
is greater than or equal to 0, and gvl
is also greater than or equal to 0 because it is an unsigned type, here ble
can be considered equivalent to bne
. So the code generation here is not the problem. The issue with the code combined with QEMU's behavior caused the inconsistency between O1 and O2 options.
Thank you very much for the detailed explanation. I must admit I had already noticed the "n < 0" conditional (which was optimized to "n < 1" in all other variations of the DOT implementation some time ago) but did not realize its implications for the loop. And I had not committed my serendipituos fix as I am still working on fixing the loss of precision in DSDOT :(
Not aware of anything wrong with zaxpy (at least nothing in the BLAS tests, have not had time to study the LAPACK test log). The issue here was that the tests would time out due to the infinite loop in dot_vector.c as Cooper-Qu analyzed above, solved by last night's commit.
Test on milk-v
full log: (in Debian) openblas_test_timeout.build.log