clMathLibraries / clBLAS

a software library containing BLAS functions written in OpenCL
Apache License 2.0
843 stars 237 forks source link

[OSX] tests fail on the *CPU* with error -54 (CL_INVALID_WORK_GROUP_SIZE) #25

Closed gicmo closed 9 years ago

gicmo commented 11 years ago

It seems that tests consistently fail with the error -54 when executed on the CPU on OSX. Looking at the OpenCL Framework header, error code -54 stands for CL_INVALID_WORK_GROUP_SIZE (see below). I did some initial investigations on where this could come from. After some digging around, I checked the CL_DEVICE_MAX_WORK_ITEM_SIZES for the CPU and this is 1024, 1, 1 for CPU (as compared to 1024, 1024, 64 for the GPU). With the debugger I checked, and e.g. for the xGEMM test, and the local_work_size arguments to clEnqueueNDRangeKernel are 8, 8 (see below).

I reproduced this bug with a small test program and indeed any local_work_size[1] > 1 will fail with error -54. (Btw, I my opinion it should be CL_INVALID_WORK_ITEM_SIZE [-55] not CL_INVALID_WORK_GROUP_SIZE[-54] according to http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html)

I guess that the fix would be to make sure all the device constraints (e.g. CL_DEVICE_MAX_WORK_ITEM_SIZES) are fulfilled when calculating the work group sizes (I guess somewhere between getStepGranulation and subgGetDefaultDecomp but I have to dig deeper).

(lldb) print kernDesc->localThreads
(size_t [3]) $34 = {
  [0] = 8
  [1] = 8
  [2] = 808
}

====

Device name: Intel(R) Core(TM) i7-3820QM CPU @ 2.70GHz
Device vendor: Intel
Platform (bit): Apple OS X
clblas version: 2.1.0
Driver version: 1.1
Device version: OpenCL 1.2
Global mem size: 16384 MB
---------------------------------------------------------

Note: Google Test filter = ColumnMajor_SmallRange/GEMM.cgemm/31
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from ColumnMajor_SmallRange/GEMM
[ RUN      ] ColumnMajor_SmallRange/GEMM.cgemm/31
clblasColumnMajor, clblasTrans, clblasNoTrans
M = 128, N = 128, K = 128
offA = 0, offB = 0, offC = 0
lda = 128, ldb = 128, ldc = 128
seed = 12345
queues = 1
Generating input data... Done
Calling reference xGEMM routine... Done
Calling clblas xGEMM routine... /Users/gicmo/Coding/src/clBLAS/src/tests/correctness/corr-gemm.cpp:180: Failure
Value of: err
  Actual: -54
Expected: 0
::clMath::clblas::GEMM() failed
[  FAILED  ] ColumnMajor_SmallRange/GEMM.cgemm/31, where GetParam() = (1, 1, 0, 128, 128, 128, 48-byte object <00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>, 1) (97 ms)
[----------] 1 test from ColumnMajor_SmallRange/GEMM (97 ms total)
gicmo commented 10 years ago

I might have some time available over the holidays to have a go on this issue if you would be interested in a fix for it.

kknox commented 10 years ago

I am interested in any fixes. If you take the time to hack code and contribute, I promise to review the pull-request and either provide feedback or merge.

We are currently investigating other clBLAS issues, so you wouldn't be duplicating any work by spending time on this.

abergeron commented 10 years ago

From what I investigated a while back, the OpenCL runtime on OS X supports a maximum local group size of 1 and you try to use 64 (I think). So it fails. I did not report it, because I was under the impression (from what I've read in the docs) that the CPU target was a very low-priority one and not much effort would be spent on it.

To fix it, you would need to test for the maximum local group size of the device and adapt the code to deal with it. This is a rather involved operation since supporting the same operations with effectively no local size is a shift in algorithmic thinking and probably very hard. The easiest way is probably to introduce looping in the kernels (if there isn't already) to simulate a 64 (or whatever) group size. But since I haven't delved in the depths of kernel generation I'm not sure how hard it would be to actually do that in clBLAS.

2013/12/22 Kent Knox notifications@github.com

I am interested in any fixes. If you take the time to hack code and contribute, I promise to review the pull-request and either provide feedback or merge.

We are currently investigating other clBLAS issues, so you wouldn't be duplicating any work by spending time on this.

— Reply to this email directly or view it on GitHubhttps://github.com/clMathLibraries/clBLAS/issues/25#issuecomment-31098309 .

kknox commented 10 years ago

Hi @gicmo I agree with @abergeron; changing the workgroup size/shape may involve serious modifications to the kernels or the generators. I didn't personally write the code, but i believe that the tune tool plays with the workgroup size/shape; it may be a good place to trace to see how it works.

pavanky commented 9 years ago

This issue is the same as the issue here: https://github.com/clMathLibraries/clBLAS/issues/60#issuecomment-72082840

This is a known problem in the OSX Intel CPU implementation. This can be resolved easily from clBLAS.

hughperkins commented 8 years ago

Do you mean to say 'this cannot be resolved easily from clBLAS'?

hughperkins commented 8 years ago

the OpenCL runtime on OS X supports a maximum local group size of 1

Oh .... wow :-O That limits a lot of things :-P Specifically any reduce type algos are going to look... different...