clMathLibraries / clBLAS

a software library containing BLAS functions written in OpenCL
Apache License 2.0
843 stars 237 forks source link

[OSX] test-short: 1713 tests fail (10.9.3, AMD FirePro D300) #37

Closed gicmo closed 10 years ago

gicmo commented 10 years ago

System is a MacPro (Late 2013) with two AMD FirePro D300. More detailed info is below.

% ./staging/test-short --gtest_filter="-*nrm2*"
[----------] Global test environment tear-down
[==========] 9792 tests from 122 test cases ran. (90050 ms total)
[  PASSED  ] 8079 tests.
[  FAILED  ] 1713 tests, listed below:

The full list can be found at the gist here

Most tests seem to fail with error -43 aka CL_INVALID_BUILD_OPTIONS:

Calling clblas xNRM2 routine...
========================================================

AN INTERNAL KERNEL BUILD ERROR OCCURRED!
device name = AMD Radeon HD - FirePro D300 Compute Engine
error = -43

The NRM2 actually crashes the test program so I had to exclude it:

[  FAILED  ] SelectedSmall0_NRM2/NRM2.dnrm2/0, where GetParam() = (61, 4, 0, 1, 1) (5 ms)
[ RUN      ] SelectedSmall0_NRM2/NRM2.dnrm2/1
N = 61, offx = 0, incx = -11
offNRM2 = 1
queues = 1
number of command queues : 1

Generating input data... Done
Process 44384 stopped
* thread #1: tid = 0xcd0c5, 0x00007fff98480cc3 libBLAS.dylib`cblas_dnrm2 + 243, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x9038393a8)
    frame #0: 0x00007fff98480cc3 libBLAS.dylib`cblas_dnrm2 + 243
libBLAS.dylib`cblas_dnrm2 + 243:
-> 0x7fff98480cc3:  movsd  (%rsi), %xmm2
   0x7fff98480cc7:  andpd  %xmm1, %xmm2
   0x7fff98480ccb:  ucomisd %xmm4, %xmm2
   0x7fff98480ccf:  ja     0x7fff98480cf0            ; cblas_dnrm2 + 288
(lldb) bt
* thread #1: tid = 0xcd0c5, 0x00007fff98480cc3 libBLAS.dylib`cblas_dnrm2 + 243, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x9038393a8)
  * frame #0: 0x00007fff98480cc3 libBLAS.dylib`cblas_dnrm2 + 243
    frame #1: 0x0000000100007609 test-short`dnrm2(n=61, x=0x0000000103839400, incx=-11) + 41 at blas-lapack.c:848
    frame #2: 0x00000001003984cb test-short`blasDnrm2(N=61, X=0x0000000103839400, offx=0, incx=-11) + 59 at blas.c:4945
    frame #3: 0x000000010039e08b test-short`clMath::blas::nrm2(N=61, X=0x0000000103839400, offx=0, incx=-11) + 43 at blas-wrapper.cpp:2439
    frame #4: 0x00000001002a7381 test-short`void nrm2CorrectnessTest<double, double>(params=0x00007fff5fbfe9f0) + 1745 at corr-nrm2.cpp:124
    frame #5: 0x00000001002a43cb test-short`NRM2_dnrm2_Test::TestBody(this=0x0000000102d05670) + 43 at corr-nrm2.cpp:203
    frame #6: 0x00000001003ea0a3 test-short`void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 131
    frame #7: 0x00000001003d4677 test-short`void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 119
    frame #8: 0x00000001003ad9f5 test-short`testing::Test::Run() + 197
    frame #9: 0x00000001003aeccb test-short`testing::TestInfo::Run() + 219
    frame #10: 0x00000001003afbf7 test-short`testing::TestCase::Run() + 231
    frame #11: 0x00000001003bc6e8 test-short`testing::internal::UnitTestImpl::RunAllTests() + 952
    frame #12: 0x00000001003e7033 test-short`bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 131
    frame #13: 0x00000001003d6f07 test-short`bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 119
    frame #14: 0x00000001003bc2a6 test-short`testing::UnitTest::Run() + 422
    frame #15: 0x00000001002f7d51 test-short`RUN_ALL_TESTS() + 17 at gtest.h:2288
    frame #16: 0x00000001002d7ab7 test-short`main(argc=1, argv=0x00007fff5fbff498) + 1015 at test-correctness.cpp:3397

More detailed system info:

% sw_vers                                                                                                                                                                                  [develop|…]
ProductName:    Mac OS X
ProductVersion: 10.9.3
BuildVersion:   13D65

clinfo output can be found here

gicmo commented 10 years ago

I have investigated the -43 (CL_INVALID_BUILD_OPTIONS) kernel build error by playing with the build options (samples from clBlas) in the opencl hello world by apple. It is quite funny (NB: the last line has two whitespaces between -g and -DPACKED):

clBuildProgram(program, 0, NULL, " -g ", NULL, NULL); -> Computed '1024/1024' correct values!
clBuildProgram(program, 0, NULL, "-g -DPACKED ", NULL, NULL); -> Computed '1024/1024' correct values!
clBuildProgram(program, 0, NULL, " -g -DPACKED", NULL, NULL); -> Computed '1024/1024' correct values!
clBuildProgram(program, 0, NULL, " -g  -DPACKED", NULL, NULL); -> Error: Failed to build program executable! -43

So it seems the two whitespaces are causing the build error. We get them basically every time we have two options because each individual option is appended with a leading and trailing whitespace: strcat( buildOptStr, " -DDOUBLE_PRECISION ");

gicmo commented 10 years ago

Without the patch in pr #38 we have 1449 (two counts less then in the log below, because there are two actual numbers starting with -43) occurrences of the build opts error:

% ./staging/test-short --gtest_filter="-*nrm2*" 2>1 > test0.log
% cat test0.log | grep -- '-43' | wc -l                                                                                                  
    1451

With the patch applied only the two numbers and the failed test count goes down to 325:

% cat test6.log | grep -- '-43'                                                                                                          
676:((a).s[0]) evaluates to -438436403418805,
677:((b).s[0]) evaluates to -438436190110975, and
[----------] Global test environment tear-down
[==========] 9792 tests from 122 test cases ran. (407823 ms total)
[  PASSED  ] 9467 tests.
[  FAILED  ] 325 tests, listed below:
 [ ... ]
325 FAILED TESTS

Full failing test list is here The now failing tests seems to be a divergence from the expected result, e.g:

Initialize OpenCL and clblas...
---- AMD
---- AMD
SetUp: about to create command queues

Test environment:

Device name: AMD Radeon HD - FirePro D300 Compute Engine
Device vendor: AMD
Platform (bit): Apple OS X
clblas version: 2.1.0
Driver version: 1.2 (May  2 2014 23:41:16)
Device version: OpenCL 1.2
Global mem size: 2048 MB
---------------------------------------------------------

Note: Google Test filter = ColumnMajor_SmallRange/GEMM.dgemm/24
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from ColumnMajor_SmallRange/GEMM
[ RUN      ] ColumnMajor_SmallRange/GEMM.dgemm/24
clblasColumnMajor, clblasTrans, clblasNoTrans
M = 63, N = 63, K = 63
offA = 0, offB = 0, offC = 0
lda = 63, ldb = 63, ldc = 63
seed = 12345
queues = 1
Generating input data... Done
Calling reference xGEMM routine... Done
Calling clblas xGEMM routine... Done
m : 2    n: 3
/Users/gicmo/Coding/clBLAS/src/tests/include/matrix.h:327: Failure
The difference between a and b is 122429370, which exceeds delta, where
a evaluates to 151881286308933,
b evaluates to 151881163879563, and
delta evaluates to 0.
[  FAILED  ] ColumnMajor_SmallRange/GEMM.dgemm/24, where GetParam() = (1, 1, 0, 63, 63, 63, 48-byte object <00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>, 1) (8 ms)
[----------] 1 test from ColumnMajor_SmallRange/GEMM (8 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (8 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] ColumnMajor_SmallRange/GEMM.dgemm/24, where GetParam() = (1, 1, 0, 63, 63, 63, 48-byte object <00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>, 1)

 1 FAILED TEST
gicmo commented 10 years ago

I was able to reproduce the crash of nrm2 tests on a different machine and I have nailed it down to calling the (normal, i.e. host) blas function with a negative incx parameter.

gicmo commented 10 years ago

After applying the guards from patch dbe7741 to fix the nrm2 tests I got an additional 10 tests that failed on the machine with the GeForce GT 650M. I could fix them all by calling the cblas_* functions instead of the FORTRAN style ones. With all the patches from pr #39 applied all tests pass on this machine:

[==========] 8068 tests from 116 test cases ran. (256674 ms total)
[  PASSED  ] 8068 tests

I will have to wait for Monday to see what tests are still failing on the MacPro with the FirePro D300 with all the patches applied.

kknox commented 10 years ago

What host lapack libraries were you using on your test machines? 'Accelerate' on OSX? What about the machine with the Nvidia card in it?

Would you agree that the API's should be able to handle negative values for incx? The netlib reference code seems to have special logic to handle N < 1 and incx < 1.

I am willing to accept the pull request, but I think we should have all of this documented. Possibly, if these are faults of the host libraries, we should file bugs in their issue trackers.

gicmo commented 10 years ago

I indeed use Accelerate on all OSX machines (the MacBook Pro with an Intel and an Nvidia card in it, and the MacPro which has two FirePro D300).

Yes, I think we should treat netlib's BLAS as the reference and expect others to have like this too.

Since tests seem were working before maybe something has changed in the Accelerate framework to not handle the input correctly anymore? I have filed a bug in Apples Bug Report (17341378) about it anyway.

Do you want me to add some comments to the source next to the if, stating that we are doing that because of this issue?

kknox commented 10 years ago

Yes, please. Put in the comments your issue number, so that future maintainers know when its safe to remove the #defines.

Thx for already filing a bug in the apple tracker.

gicmo commented 10 years ago

Done. Hope it is sufficient. I have also put the sample program I used for the Apple bug report in a little gist here.

kknox commented 10 years ago

This issue is closed by merging in Pull Requests #38 & #39

gicmo commented 10 years ago

The crash (Apple Bug id 17341378) in nrm2 on OS X has been fixed in OS X Yosemite.