OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.36k stars 1.5k forks source link

Problem with ARMv7 (cblas_ddot) #1168

Closed misbullah closed 6 years ago

misbullah commented 7 years ago

Hi,

Recently, I used OpenBLAS with Kaldi Toolkit for Android. I can use OpenBLAS for Intel ATOM platform and get desired result from Kaldi.

When, I build OpenBLAS for armv7-a, there is no error but I get error when I run the executable file. The error is shown in the following.

LOG (dnn_batch_static:void kaldi::IvectorExtractor::ComputeDerivedVars()():ivector/ivector-extractor.cc:183) Computing derived variables for iVector extractor WARNING (dnn_batch_static:void kaldi::TpMatrix::Cholesky(const kaldi::SpMatrix&) [with Real = double]():matrix/tp-matrix.cc:110) Cholesky decomposition failed. Maybe matrix is not positive definite. Throwing error terminate called after throwing an instance of 'std::runtime_error' what(): Cholesky decomposition failed. Aborted

I trace matrix/tp-matrix.cc code, then I found that the error was happened from this template:

template void TpMatrix::Cholesky(const SpMatrix &orig) { KALDIASSERT(orig.NumRows() == this->NumRows()); MatrixIndexT n = this->NumRows(); this->SetZero(); Real *data = this->data, jdata = data; // start of j'th row of matrix. const Real orig_jdata = orig.Data(); // start of j'th row of matrix. for (MatrixIndexT j = 0; j < n; j++, jdata += j, orig_jdata += j) { Real kdata = data; // start of k'th row of matrix. Real d(0.0); for (MatrixIndexT k = 0; k < j; k++, kdata += k) { Real s = cblas_Xdot(k, kdata, 1, jdata, 1); // (this)(j, k) = s = (orig(j, k) - s)/(this)(k, k); jdata[k] = s = (orig_jdata[k] - s)/kdata[k]; d = d + ss; } // d = orig(j, j) - d; d = orig_jdata[j] - d;

if (d >= 0.0) {
  // (*this)(j, j) = std::sqrt(d);
  jdata[j] = std::sqrt(d);
} else {
  KALDI_WARN << "Cholesky decomposition failed. Maybe matrix "
      "is not positive definite. Throwing error";
  throw std::runtime_error("Cholesky decomposition failed.");
}

} }

Then I check cblas_Xdot function which is defined in the cblas-wrapper.h file:

inline double cblas_Xdot(const int N, const double const X, const int incX, const double const Y, const int incY) { return cblas_ddot(N, X, incX, Y, incY); }

I think that the cblas_ddot is call from OpenBLAS library which related with floating-point value. When I build library for Intel ATOM, the problem did not happened. It happened only for ARMv7-a

I use the following command to build OpenBLAS for ARMv7-a

make TARGET=ARMV7 HOSTCC=gcc CC=arm-linux-androideabi-gcc NOFORTRAN=1 NUM_THREADS=4 ARM_SOFTFP_ABI=1 libs

I use -mfpu=neon for compile my executable file in Kaldi.

How can I solve this problem.

Thanks, Alim

martin-frbg commented 7 years ago

I may be wrong but I think for ARM_SOFTFP_ABI=1 you will need to checkout the "arm_soft_fp_abi" branch, as current "develop" and all releases have hardfp code for ARMv7 only. What is your hardware by the way ? Any chance you could build for 64bit ARMV8 instead ?

misbullah commented 7 years ago

Hi @martin-frbg,

Thank for reply.

I use OpenBLAS from arm_soft_fp_abi branch actually.

I test it on my Samsung Note 3, which has specs as following. Processor : ARMv7 Processor rev 0 (v7l) processor : 0 BogoMIPS : 38.40

processor : 1 BogoMIPS : 38.40

processor : 2 BogoMIPS : 38.40

processor : 3 BogoMIPS : 38.40

Features : swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt CPU implementer : 0x51 CPU architecture: 7 CPU variant : 0x2 CPU part : 0x06f CPU revision : 0

Hardware : Qualcomm MSM8974 Revision : 0008

If I build it for ARMV8, can it be run on my device?

I am afraid, it cannot be run.

For your information, I use the following flags when I build my executable file:

   set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -O2 -march=armv7-a -mfloat-abi=softfp -mfpu=vfpv3 -mthumb -fvisibility=hidden -ffunction-sections -fdata-sections -pie -fPIE -Wl,--gc-sections,--icf=safe -flto")
  set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -O2 -march=armv7-a -mfloat-abi=softfp -mfpu=vfpv3 -mthumb -fvisibility=hidden -ffunction-sections -fdata-sections -pie -fPIE -Wl,--gc-sections,--icf=safe -flto")

I think there are something bug for cblas_ddot, but I am not sure. I am not expert on assembly code.

Thanks, Alim

martin-frbg commented 7 years ago

Indeed it seems to be 32bit armv7 only, and unfortunately it also looks as if xianyi has not gotten around to doing the softfp implementation of ddot on that branch yet, only sdot.

martin-frbg commented 7 years ago

Looking at the diff for that change to sdot.vfp it seems the "only" change is a vmov of the result to a different register, so perhaps you could try copying the single line from the "ifdef DSDOT" case of sdot.vfp to the equivalent position at the end of ddot.vfp...

misbullah commented 7 years ago

Hi @martin-frbg,

I tried to modified ddot.vfp.S in kernel/arm directory by referring sdot.vfp.S as the following.

if defined(DDDOT)

    vadd.f64        d0 , d0, d1                             // set return value

ifdef ARM_SOFTFP_ABI

    vmov    r0, r1, d0

endif

else

    vadd.f32        s0 , s0, s1                             // set return value

ifdef ARM_SOFTFP_ABI

    vmov    r0, s0

endif

endif

    sub     sp, fp, #24
    pop     {r4 - r9, fp}
    bx      lr

    EPILOGUE

And tried to rebuild openblas library for ARMv7, but the error still similar like I mentioned in first question.

Do I just need to modified ddot_vfp.S file or other files need to modified too?

Thanks, Alim

martin-frbg commented 7 years ago

Sorry, to clarify my idea was to copy just the

#ifdef ARM_SOFTFP_ABI
vmov r0, r1, d0
#endif

from lines 335 to 337 of xianyi's sdot_vfp.S and put it after the vadd.f64 d0 , d0, d1 // set return value in line 248 of the ddot_vfp.S

misbullah commented 7 years ago

Hi @martin-frbg,

After I add the line as your suggestion in ddot_vfp.S file, the error is solved now.

But now, I get Segmentation fault when run executable file.

I use the following command to build OpenBLAS library then I get segmentation fault as error.

make TARGET=ARMV7 HOSTCC=gcc CC=arm-linux-androideabi-gcc NOFORTRAN=1 ARM_SOFTFP_ABI=1 libs

If I use the below command: make TARGET=ARMV7 HOSTCC=gcc CC=arm-linux-androideabi-gcc NOFORTRAN=1 ARM_SOFTFP_ABI=1 NUM_THREADS=1 libs

I got this message: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.

Maybe you have any suggestion about the issue.

Thanks, Alim

martin-frbg commented 7 years ago

Do you get any indication where in the code the segmentation fault occurs (i.e., is it in the ddot_vfp.S that we just changed, or is it unrelated) ? When building with NUM_THREADS=1 you will restrict the size of an internal structure that keeps track of the threads to just one - this is most likely not what you want. Please try building with USE_THREAD=0 if you really want a single-threaded OpenBLAS, or limit the number of threads at runtime by setting the environment variable OPENBLAS_NUM_THREADS to the desired maximum.

misbullah commented 7 years ago

Hi @martin-frbg,

I did not get any indication where in the code the segmentation fault occur. I think it does not related with ddot_vfp.S anymore.

I tried to build with USE_THREAD=0 but I also get segmentation fault. I also try to build with OPENBLAS_NUM_THREADS=8 (because my phone's core is 4) but I also get segmentation fault.

I don't know how to debug the error from mobile device. I just use the following command to debug error.

adb shell logcat | ndk-stack -sym $PROJECT_PATH/obj/local/armeabi

and shown the following error:

** Crash dump: ** Build fingerprint: 'samsung/hlteztu/hlte:5.0/LRX21V/N900UZTUBOI1:user/release-keys' pid: 19210, tid: 19238, name: dnn_batch_stati >>> ./dnn_batch_static <<< signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0xa78f926c Stack frame #00 pc 00187c10 /data/kaldi/dnn_batch_static Crash dump is completed

I think that there does not show important information.

Do you think I only need to modified ddot_vfp.S file or need to modified other files?

Thanks, Alim

martin-frbg commented 7 years ago

You could try building OpenBLAS with debugging symbols, perhaps this will show function names instead of raw addresses in the dump. Looking at how LAPACK dpotf2 does a Cholesky factorization, you will probably need a suitable dscal and dgemv in addition to ddot. Unfortunately it does not seem to be as simple to just spot and copy the fixes from xianyi's sscal/sgemv work - with the ddot we were just lucky that the sdot code already had an option to do the calculation with doubles.

ctgushiwei commented 7 years ago

@misbullah what version openblas do you use?the develop version or master?I use 0.2.19 version can not test normally

misbullah commented 7 years ago

@ctgushiwei what do you mean by cannot test normally?

I use 0.2.19 OpenBLAS with master version.

Currently, I still cannot use OpenBLAS on ARMv7 for some matrix operation like Cholesky factorization.

Thanks, Alim

amiasato-zz commented 7 years ago

I am with the exactly same problem. I had the Cholesky decomposition failure in Kaldi, and after the suggested edit in ddot.vfp.S, I am having a segmentation fault. I am putting the gdb backtrace here and I will try to compile the executable with debugging symbols later, since I'm not exactly well-versed in Android development.

Thread 2 received signal SIGSEGV, Segmentation fault.
[Switching to Thread 2285.2303]
0xb6ee85d0 in axpy_kernel_S10 ()
(gdb) bt
#0  0xb6ee85d0 in axpy_kernel_S10 ()
#1  0xb6ec3fac in dspmv_U ()
#2  0xafc00000 in ?? ()

Just for the record, I'm trying to compile OpenBLAS in the arm_soft_fp_abi branch.

martin-frbg commented 7 years ago

The few commits to the arm_soft_fp_abi branch are for single precision calculations only, and ddot is/was the only case where it is trivial to copy the required change (as there is special code in sdot for doing the calculation with doubles as well, and xianyi already changed it for soft fp). Everything else needs a developer with some knowledge of ARM assembly (which I am not)

ctgushiwei commented 7 years ago

@misbullah I can compiled 0.2.19 release version successfully ,but when i test cblas_sgemm,i go to segmentation error. i have known the reason,the code at openblas_0.2.19/kernel/ can not compile to .o file,i do not know how to solve this problem