OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.29k stars 1.49k forks source link

TARGET=C910V zblat2 test timeout #4098

Closed RevySR closed 1 year ago

RevySR commented 1 year ago

Test on milk-v

OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./sblat2 < ./sblat2.dat
f77 -march=rv64gcv0p7_zfh_xtheadc -mabi=lp64d -g -O2 -ffile-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -frecursive  -fno-tree-vectorize -Wl,-z,relro -o cblat3 cblat3.o ../libopenblas_riscv64_genericp-r0.3.23.a -lm -lpthread -lgfortran -lm -lpthread -lgfortran -L/usr/lib/gcc/riscv64-linux-gnu/10 -L/lib/riscv64-linux-gnu -L/usr/lib/riscv64-linux-gnu  -lc
rm -f ?BLAT3.SUMM
OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./sblat3 < ./sblat3.dat
OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./dblat3 < ./dblat3.dat
OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./cblat3 < ./cblat3.dat
OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./zblat3 < ./zblat3.dat
rm -f ?BLAT3.SUMM
OPENBLAS_NUM_THREADS=2 ./sblat3 < ./sblat3.dat
OPENBLAS_NUM_THREADS=2 ./dblat3 < ./dblat3.dat
OPENBLAS_NUM_THREADS=2 ./cblat3 < ./cblat3.dat
OPENBLAS_NUM_THREADS=2 ./zblat3 < ./zblat3.dat
E: Build killed with signal TERM after 150 minutes of inactivity
--------------------------------------------------------------------------------

full log: (in Debian) openblas_test_timeout.build.log

RevySR commented 1 year ago
image
RevySR commented 1 year ago

actually OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./sblat2 < ./sblat2.dat timeout

brada4 commented 1 year ago

Attach gdb to process and at least get backtrace:

$ script
$ gdb
gdb> attach <pid>
gdb> t a a bt
gdb> detach
gdb> quit
$ ^D
RevySR commented 1 year ago
image

qemu-user still timeout

RevySR commented 1 year ago

Attach gdb to process and at least get backtrace:

$ script
$ gdb
gdb> attach <pid>
gdb> t a a bt
gdb> detach
gdb> quit
$ ^D
(gdb) bt
#0  sdot_k (n=n@entry=0, x=x@entry=0x3ff4967d78, inc_x=inc_x@entry=1, y=y@entry=0x3ff4963140, inc_y=inc_y@entry=1)
    at /usr/lib/gcc/riscv64-linux-gnu/10/include/riscv_vector.h:954
#1  0x0000002addb94072 in ssbmv_U (n=n@entry=1, k=k@entry=0, alpha=alpha@entry=1, a=a@entry=0x3ff4967d78, lda=lda@entry=2,
    x=x@entry=0x3ff4963140, incx=incx@entry=1, y=y@entry=0x3ff4963550, incy=incy@entry=1, buffer=buffer@entry=0x3f8d1c4000)
    at sbmv_k.c:77
#2  0x0000002addb92fd4 in ssbmv_ (UPLO=UPLO@entry=0x3ff49626e0 "U9\226\364?", N=N@entry=0x3ff4962454, K=K@entry=0x3ff4962440,
    ALPHA=<optimized out>, a=a@entry=0x3ff4967d78, LDA=LDA@entry=0x3ff4962448, x=x@entry=0x3ff4963140, INCX=<optimized out>,
    BETA=0x3ff4962428, y=y@entry=0x3ff4963550, INCY=0x3ff496243c) at sbmv.c:206
#3  0x0000002addb8c81e in schk2 (sname=..., eps=1.1920929e-07, thresh=16, nout=6, ntra=-1, trace=.FALSE., rewi=.FALSE., fatal=.FALSE.,
    nidim=7, idim=..., nkb=4, kb=..., nalf=3, alf=..., nbet=3, bet=..., ninc=4, inc=..., nmax=65, incmax=2, a=..., aa=..., as=...,
    x=..., xx=..., xs=..., y=..., yy=..., ys=..., yt=..., g=..., _sname=_sname@entry=6) at sblat2.f:955
#4  0x0000002addb919ba in sblat2 () at sblat2.f:350
#5  0x0000002addb8790a in main (argc=<optimized out>, argv=<optimized out>) at sblat2.f:429
#6  0x0000003f8f20aad6 in __gconv_transform_internal_ascii (step=0x2addb87888 <__memset_chk@plt+8>, data=0xffffffffffffffd1,
    inptrp=<optimized out>, inend=<optimized out>, outbufstart=<optimized out>, irreversible=0x26a62951766, do_flush=<optimized out>,
    consume_incomplete=<optimized out>) at ../iconv/skeleton.c:501
Backtrace stopped: frame did not save the PC
martin-frbg commented 1 year ago

Thanks. I'm seeing the hangs as well (in qemu), but I'm not convinced sdot_k itself (dot_vector.c) is at fault as sblat1 runs without error. (Not really sure if this ever worked, but most likely I made a mistake in my adaption of the old code. I just barely got my Lichee RV (C906-based) to work and it appears to be even slower than cross-compiler&qemu and less stable...)

martin-frbg commented 1 year ago

Looks to be an optimizer thing with the Xuantie compiler at least - the BLAS2 tests run fine when everything is compiled at -O0 -g (Same for BLAS3)

RevySR commented 1 year ago

Looks to be an optimizer thing with the Xuantie compiler at least - the BLAS2 tests run fine when everything is compiled at -O0 -g (Same for BLAS3)

OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./sblat2 < ./sblat2.dat In -O0 optimize, the test case is passed. tested on milk-v.

martin-frbg commented 1 year ago

and -O1 also works... I'm now trying with -O1 for the LAPACK (fortran) parts only

martin-frbg commented 1 year ago

Yes, works (change in Makefile.system around line 1614:

- LAPACK_FFLAGS := $(FFLAGS)
+LAPACK_FFLAGS := $(filter-out -O2,$(FFLAGS))
+LAPACK_FFLAGS += -O1

(now also the potrf tests from openblas_utest do not hang anymore.) Remaining oddities: dblat1 test reports a failure in IDAMAX (although it uses basically the same source file as ISAMAX, iamax_vector.c). And utest reports a loss of precision in DSDOT.

RevySR commented 1 year ago

Yes, works (change in Makefile.system around line 1614:

- LAPACK_FFLAGS := $(FFLAGS)
+LAPACK_FFLAGS := $(filter-out -O2,$(FFLAGS))
+LAPACK_FFLAGS += -O1

(now also the potrf tests from openblas_utest do not hang anymore.) Remaining oddities: dblat1 test reports a failure in IDAMAX (although it uses basically the same source file as ISAMAX, iamax_vector.c). And utest reports a loss of precision in DSDOT.

In Debian CORE=C910V & thead gcc:

CFLAGS FFLAGS result
O2 O2 ❌ Timeout
O2 O1 ❌ Timeout
O1 O2 ✅ Passed
O1 O1 ✅ Passed

I think the code added in Makefile.system should not work.

martin-frbg commented 1 year ago

Pity, it works for me (in Ubuntu 22, cross-compiling from x86_64 and testing in qemu) - I have pretty much given up trying to compile on the LicheeRV directly. Must be some difference between the gcc-10.2 from the Xuantie 2.6.1 cross toolchain and your native riscv64 build of gcc 10.4

Cooper-Qu commented 1 year ago

This issue is caused by the implementation of the sdot_k function in the kernel/riscv64/dot_vector.c file.

#if defined(DSDOT)
double CNAME(BLASLONG n, FLOAT *x, BLASLONG inc_x, FLOAT *y, BLASLONG inc_y)
#else
FLOAT CNAME(BLASLONG n, FLOAT *x, BLASLONG inc_x, FLOAT *y, BLASLONG inc_y)
#endif
{
        BLASLONG i=0, j=0;
        double dot = 0.0 ;
        if ( n < 0 )  return(dot);

        FLOAT_V_T vr, vx, vy;
        unsigned int gvl = 0;
        FLOAT_V_T_M1 v_res, v_z0;
        gvl = VSETVL_MAX;
        v_res = VFMVVF_FLOAT_M1(0, gvl);
        v_z0 = VFMVVF_FLOAT_M1(0, gvl);

        if(inc_x == 1 && inc_y == 1){
                gvl = VSETVL(n);
                vr = VFMVVF_FLOAT(0, gvl);
                for(i=0,j=0; i<n/gvl; i++){
                        vx = VLEV_FLOAT(&x[j], gvl);
                        vy = VLEV_FLOAT(&y[j], gvl);
                        vr = VFMACCVV_FLOAT(vr, vx, vy, gvl);
                        j += gvl;
                }
......
}

The parameter n may be zero, and when n is zero, gvl is also zero. This will cause the loop for(i=0,j=0; i<n/gvl; i++) to go into an infinite loop. Because the result of executing 0/0 in qemu-user is -1 without throwing a divide-by-zero exception, which is a bit strange but it is the case. I don't know if this was deliberately designed this way or if there are other reasons for it. So I think that this piece of code lacks handling for the case where n is zero, which leads to this issue.

The reason why there is no problem in O1 while error occurs in O2

Due to optimization, there is a difference in the code generation at the loop header between O2 and O1. The purpose of the loop header is to check whether n/gvl is less than or equal to zero, so that the loop for(i=0,j=0; i<n/gvl; i++) can be skipped directly. The codegen by O1 is

div     t3,a0,a5
ble     t3,zero,.L23

The codegen by O2 is

div     t4,a0,t1
beq     t4,zero,.L23

Since it can be inferred from the context that n is greater than or equal to 0, and gvl is also greater than or equal to 0 because it is an unsigned type, here ble can be considered equivalent to bne. So the code generation here is not the problem. The issue with the code combined with QEMU's behavior caused the inconsistency between O1 and O2 options.

martin-frbg commented 1 year ago

Thank you very much for the detailed explanation. I must admit I had already noticed the "n < 0" conditional (which was optimized to "n < 1" in all other variations of the DOT implementation some time ago) but did not realize its implications for the loop. And I had not committed my serendipituos fix as I am still working on fixing the loss of precision in DSDOT :(

xianyi commented 1 year ago

Is it a bug in zaxpy e14a025 ?

martin-frbg commented 1 year ago

Not aware of anything wrong with zaxpy (at least nothing in the BLAS tests, have not had time to study the LAPACK test log). The issue here was that the tests would time out due to the infinite loop in dot_vector.c as Cooper-Qu analyzed above, solved by last night's commit.