clMathLibraries / clBLAS

a software library containing BLAS functions written in OpenCL
Apache License 2.0
839 stars 240 forks source link

SEGV for test-correctness and test-short #236

Closed Gijom closed 8 years ago

Gijom commented 8 years ago

Hello,

I am facing FAILED tests and a SEGV when executing either test-correctness or test-short executables.

Here is the output of test-correctness:

Initialize OpenCL and clblas...
---- Advanced Micro Devices, Inc.
SetUp: about to create command queues
Kernel Cache limit: 256 MB

Test environment:

Device name: Oland
Device vendor: Advanced Micro Devices, Inc.
Platform (bit): Linux
clblas version: 2.11.0
Driver version: 1912.5 (VM)
Device version: OpenCL 1.2 AMD-APP (1912.5)
Global mem size: 784 MB
---------------------------------------------------------
.
.
several times the FAILED error below concerning matrix.h with GEMM.sgem
.
.
[ RUN      ] ColumnMajor_SmallRange/GEMM.sgemm/6560
             seed = 12345, queues = 1, clblasColumnMajor, clblasConjTrans, clblasConjTrans, M = 256, N = 256, K = 256, offA = 0, offB = 0, offC = 0, lda = 256, ldb = 256, ldc = 256
m : 0    n: 0
/home/chanel/src/extern/clBLAS/src/tests/include/matrix.h:327: Failure
The difference between a and b is 234671, which exceeds delta, where
a evaluates to -1,
b evaluates to -234672, and
delta evaluates to 0.
[  FAILED  ] ColumnMajor_SmallRange/GEMM.sgemm/6560, where GetParam() = (1, 2, 2, 256, 256, 256, 48-byte object <00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>, 1) (10 ms)
[ RUN      ] ColumnMajor_SmallRange/GEMM.dgemm/0
             seed = 12345, queues = 1, clblasColumnMajor, clblasNoTrans, clblasNoTrans, M = 8, N = 8, K = 8, offA = 0, offB = 0, offC = 0, lda = 8, ldb = 8, ldc = 8
[1]    14292 segmentation fault (core dumped)  LD_LIBRARY_PATH=/opt/acml5.3.1/ifort64/lib:/opt/clBLAS/lib64 

So after re-reading it seem there are two problems here:

For test short only one of the sgemm tests is FAILED:

[ RUN      ] ColumnMajor_SmallRange/GEMM.sgemm/39
             seed = 12345, queues = 1, clblasColumnMajor, clblasTrans, clblasTrans, M = 128, N = 128, K = 128, offA = 0, offB = 0, offC = 0, lda = 128, ldb = 128, ldc = 128
m : 0    n: 0
/home/chanel/src/extern/clBLAS/src/tests/include/matrix.h:327: Failure
The difference between a and b is 90788, which exceeds delta, where
a evaluates to -4,
b evaluates to -90792, and
delta evaluates to 0.
[  FAILED  ] ColumnMajor_SmallRange/GEMM.sgemm/39, where GetParam() = (1, 1, 1, 128, 128, 128, 48-byte object <00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>, 1) (40 ms)

And the code also ends up in a SEGV for dgemm:

[ RUN      ] ColumnMajor_SmallRange/GEMM.dgemm/0
             seed = 12345, queues = 1, clblasColumnMajor, clblasNoTrans, clblasNoTrans, M = 63, N = 63, K = 63, offA = 0, offB = 0, offC = 0, lda = 63, ldb = 63, ldc = 63
[1]    14545 segmentation fault (core dumped)  LD_LIBRARY_PATH=/opt/acml5.3.1/ifort64/lib:/opt/clBLAS/lib64 ./test-short
Gijom commented 8 years ago

After a gdb run I realized the SIGSEGV comes from libacml.so:

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff619bb9b in dmmavxalphablkb_ () from /opt/acml5.3.1/ifort64/lib/libacml.so

Possibly related to this issue: https://community.amd.com/thread/169180

tingxingdong commented 8 years ago

Hi, Gijom:

thanks for the input. It seems that the error is from ACML. ACML has not been updated for years. Various resources report error. Instead you can link to Netlib BLAS. other than ACML as the against CPU library.

search "netlib" in /src/CMakeLists.txt for more details how to switch to "netlib".

Gijom commented 8 years ago

Actually this is what I did in the first place and switched to ACML cause I had compilation errors. The compilation error correspond to #184.

However I am sure that I am using net BLAS:

$ repoquery -l blas-devel
/usr/lib/libblas.so
/usr/lib64/libblas.so

$ repoquery -l openblas-devel
/usr/include/openblas
/usr/include/openblas/cblas.h
/usr/include/openblas/f77blas.h
/usr/include/openblas/lapacke.h
/usr/include/openblas/lapacke_config.h
/usr/include/openblas/lapacke_mangling.h
/usr/include/openblas/lapacke_utils.h
/usr/include/openblas/openblas_config.h
/usr/lib64/libopenblas.so
/usr/lib64/libopenblas64.so
/usr/lib64/libopenblas64_.so
/usr/lib64/libopenblaso.so
/usr/lib64/libopenblaso64.so
/usr/lib64/libopenblaso64_.so
/usr/lib64/libopenblasp.so
/usr/lib64/libopenblasp64.so
/usr/lib64/libopenblasp64_.so

And with cmake: Netlib_BLAS_LIBRARY=/usr/lib64/libblas.so

Nerver the less I consider this case closed then and will continue on #184 if needed.

tingxingdong commented 8 years ago

see my solution to that error

https://github.com/clMathLibraries/clBLAS/issues/238

On Fri, Mar 11, 2016 at 4:33 AM, Gijom notifications@github.com wrote:

Actually this is what I did in the first place and switched to ACML cause I had compilation errors. The compilation error correspond to #184 https://github.com/clMathLibraries/clBLAS/issues/184.

However I am sure that I am using net BLAS:

$ repoquery -l blas-devel /usr/lib/libblas.so /usr/lib64/libblas.so

$ repoquery -l openblas-devel /usr/include/openblas /usr/include/openblas/cblas.h /usr/include/openblas/f77blas.h /usr/include/openblas/lapacke.h /usr/include/openblas/lapacke_config.h /usr/include/openblas/lapacke_mangling.h /usr/include/openblas/lapacke_utils.h /usr/include/openblas/openblasconfig.h /usr/lib64/libopenblas.so /usr/lib64/libopenblas64.so /usr/lib64/libopenblas64.so /usr/lib64/libopenblaso.so /usr/lib64/libopenblaso64.so /usr/lib64/libopenblaso64.so /usr/lib64/libopenblasp.so /usr/lib64/libopenblasp64.so /usr/lib64/libopenblasp64.so

And with cmake: Netlib_BLAS_LIBRARY=/usr/lib64/libblas.so

— Reply to this email directly or view it on GitHub https://github.com/clMathLibraries/clBLAS/issues/236#issuecomment-195311450 .

Tingxing dong

kknox commented 8 years ago

Good job @tingxingdong