OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.39k stars 1.5k forks source link

sgemm_kernel crash #1134

Closed gangliao closed 7 years ago

gangliao commented 7 years ago

Hi,

We use the OpenBLAS as our third party libs in PaddlePaddle. Occasionally, some user has sgemm_kernel error when they execute our demo?

Any idea how to fix it?

Thanks a lot.

[I 10:03:53.531 NotebookApp] Kernel started: 355ddf25-33ed-49ed-a397-c43d10f22651 I0322 10:04:22.283298 13 Util.cpp:160] commandline: --use_gpu=False --trainer_count=1 I0322 10:04:23.307060 13 GradientMachine.cpp:86] Initing parameters.. I0322 10:04:23.307096 13 GradientMachine.cpp:93] Init parameters done. 
Thread [139674572465920] Forwarding __fc_layer_0__, x, 
*** Aborted at 1490177067 (unix time) try "date -d @1490177067"
 if you are using GNU date
 *** PC: @ 0x0 (unknown) *** SIGILL (@0x7f08633e4522) received by PID 13 (TID 0x7f088546a700) 
from PID 1665025314;
 stack trace: *** @ 0x7f0884c24890 (unknown) 
@ 0x7f08633e4522 sgemm_kernel 
[I 10:04:29.532 NotebookApp] KernelRestarter: restarting kernel (1/5) 
WARNING:root:kernel 355ddf25-33ed-49ed-a397-c43d10f22651 restarted
martin-frbg commented 7 years ago

Is this a single occurence, or do you get similar reports from other users as well ? From your issue 248 I take it this is on an i5-2450M ("SandyBridge" core) running Ubuntu 16.04 - do you supply your own copy of OpenBLAS with your software or is this with whatever version 16.04 ships by default ? (Perhaps it might even make sense to have your user check (with update-alternatives) that it is actually OpenBLAS that they are using, rather than netlib or atlas ?) In my experience, SIGILL can mean either a genuine instruction that the cpu is not capable of handling (unlikely unless this is with an OpenBLAS that was built specifically for TARGET=HASWELL) or stack corruption creating absurd return addresses for a function. In the latter case it would help to know if the problem is reproducible on another system (preferably not using the same Ubuntu 16.04), or to have a minimal self-contained example (I assume your demo does a lot more than just the problematic sgemm call)

gangliao commented 7 years ago

@martin-frbg This problem is from https://github.com/PaddlePaddle/book/issues/248.

Actually, This is a docker image, as the user said

I just test it on a Microsoft Azure Ubuntu 16.04 instance and it works.
It is most probably a missing instruction on my laptop.

For PaddlePaddle, we use external project to build OpenBlas https://github.com/PaddlePaddle/Paddle/blob/develop/cmake/external/openblas.cmake#L48

How to build a more generic libopenblas.a?

martin-frbg commented 7 years ago

That cmake file would indeed build an OpenBLAS that is tailored to the cpu of the build system. Please add "DYNAMIC_ARCH=1" to the build flags to get a (bigger) libopenblas.a with support for a range of x86 cpus (and builtin code to select the most appropriate one at runtime), or if library size is a concern build for the oldest, least sophisticated cpu you expect to encounter, e.g. TARGET=NEHALEM.

gangliao commented 7 years ago

Thanks for your suggestion, really helpful!

brada4 commented 7 years ago

You also need to install ubuntu cblas wrapper and probably some libblas-dev package so that padel build system detects cblas and skips making broken local build.

brada4 commented 7 years ago

I examined other issue. Can you get /proc/cpuinfo (last core is enough) from inside particular docker container?

gangliao commented 7 years ago

https://github.com/PaddlePaddle/Paddle/issues/1697 @brada4 Shall we also need to build cblas?

brada4 commented 7 years ago

Invalid instruction comes from single-architecture build since sgemm_kernel is not present in DYNAMIC_ARCH build, there you find:

sgemm_kernel_ATOM
sgemm_kernel_BARCELONA
sgemm_kernel_BOBCAT
sgemm_kernel_BULLDOZER
sgemm_kernel_CORE2
sgemm_kernel_DUNNINGTON
sgemm_kernel_EXCAVATOR
sgemm_kernel_HASWELL
sgemm_kernel_NANO
sgemm_kernel_NEHALEM
sgemm_kernel_OPTERON
sgemm_kernel_OPTERON_SSE3
sgemm_kernel_PENRYN
sgemm_kernel_PILEDRIVER
sgemm_kernel_PRESCOTT
sgemm_kernel_SANDYBRIDGE
sgemm_kernel_STEAMROLLER

Building less saves your time building?

Since you install numpy (seen in your dockerfile) I would suggest to install libblas-dev libcblas? and libopenblas-dev (0.2.18 if you stay with ubuntu 16LTS) and select openblas as libblas.so.3 using update-alternatives. And check all build logs that you link libraries only to libblas (one that redirects, not one that is reference implementation) and not to any specific implementation of BLAS.

martin-frbg commented 7 years ago

@brada4 I think the original issue is clear by now. I take it you want to discourage them from building OpenBLAS themselves, and rely on the older version provided by Ubuntu instead ? @gangliao which combination of options did you use for the docker build that failed (and if you used the same source tree as before, did you do a "make clean" first to remove potentially incompatible files from the previous build) ?

brada4 commented 7 years ago

@martin-frbg indeed. They have ubuntu numpy, which means that they will have 2 BLAS implementations in same process (one built and other update-alternatives). I could add to FAQ how to add latest OpenBLAS to Debian and LTS alternatives - hmm?

Another thing - docker container is a virtual machine, hard to guess it is KVM or qemu emulator. (Hyper-V nesting KVM as on azure works fine already)