Closed gicmo closed 9 years ago
I might have some time available over the holidays to have a go on this issue if you would be interested in a fix for it.
I am interested in any fixes. If you take the time to hack code and contribute, I promise to review the pull-request and either provide feedback or merge.
We are currently investigating other clBLAS issues, so you wouldn't be duplicating any work by spending time on this.
From what I investigated a while back, the OpenCL runtime on OS X supports a maximum local group size of 1 and you try to use 64 (I think). So it fails. I did not report it, because I was under the impression (from what I've read in the docs) that the CPU target was a very low-priority one and not much effort would be spent on it.
To fix it, you would need to test for the maximum local group size of the device and adapt the code to deal with it. This is a rather involved operation since supporting the same operations with effectively no local size is a shift in algorithmic thinking and probably very hard. The easiest way is probably to introduce looping in the kernels (if there isn't already) to simulate a 64 (or whatever) group size. But since I haven't delved in the depths of kernel generation I'm not sure how hard it would be to actually do that in clBLAS.
2013/12/22 Kent Knox notifications@github.com
I am interested in any fixes. If you take the time to hack code and contribute, I promise to review the pull-request and either provide feedback or merge.
We are currently investigating other clBLAS issues, so you wouldn't be duplicating any work by spending time on this.
— Reply to this email directly or view it on GitHubhttps://github.com/clMathLibraries/clBLAS/issues/25#issuecomment-31098309 .
Hi @gicmo I agree with @abergeron; changing the workgroup size/shape may involve serious modifications to the kernels or the generators. I didn't personally write the code, but i believe that the tune tool plays with the workgroup size/shape; it may be a good place to trace to see how it works.
This issue is the same as the issue here: https://github.com/clMathLibraries/clBLAS/issues/60#issuecomment-72082840
This is a known problem in the OSX Intel CPU implementation. This can be resolved easily from clBLAS.
Do you mean to say 'this cannot be resolved easily from clBLAS'?
the OpenCL runtime on OS X supports a maximum local group size of 1
Oh .... wow :-O That limits a lot of things :-P Specifically any reduce type algos are going to look... different...
It seems that tests consistently fail with the error
-54
when executed on the CPU on OSX. Looking at the OpenCL Framework header, error code-54
stands forCL_INVALID_WORK_GROUP_SIZE
(see below). I did some initial investigations on where this could come from. After some digging around, I checked theCL_DEVICE_MAX_WORK_ITEM_SIZES
for the CPU and this is1024, 1, 1
for CPU (as compared to1024, 1024, 64
for the GPU). With the debugger I checked, and e.g. for the xGEMM test, and thelocal_work_size
arguments toclEnqueueNDRangeKernel
are8, 8
(see below).I reproduced this bug with a small test program and indeed any
local_work_size[1] > 1
will fail with error -54. (Btw, I my opinion it should be CL_INVALID_WORK_ITEM_SIZE [-55] not CL_INVALID_WORK_GROUP_SIZE[-54] according to http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html)I guess that the fix would be to make sure all the device constraints (e.g.
CL_DEVICE_MAX_WORK_ITEM_SIZES
) are fulfilled when calculating the work group sizes (I guess somewhere betweengetStepGranulation
andsubgGetDefaultDecomp
but I have to dig deeper).