clEnqueueNDRangeKernel() failed

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Make simple kernel
2. Run on machine with more than 1 GPU card
3. Fails with "clEnqueueNDRangeKernel() failed invalid work group size"

What is the expected output? What do you see instead?
Error message:
!!!!!!! clEnqueueNDRangeKernel() failed invalid work group size
after clEnqueueNDRangeKernel, globalSize=16 localSize=32 usingNull=0
Nov 15, 2011 4:07:37 PM com.amd.aparapi.KernelRunner executeOpenCL
WARNING: ### CL exec seems to have failed. Trying to revert to Java ###

What version of the product are you using? On what operating system?
2011-10-13 Ubuntu

Please provide any additional information below.
There is a check in KernelRunner.java:1081 that ensures that localSize <= 
globalSize, but in aparapi.c:1073 it does this:
size_t globalSizeAsSizeT = (globalSize /jniContext->deviceIdc);

This is done to work on multiple devices, and the following loop enqueues the 
work on multiple devices, but calls clEnqueueNDRangeKernel() with these 
numbers. According to the OpenCL docs, the error code means:

"CL_INVALID_WORK_GROUP_SIZE if local is specified and number of workitems 
specified by global is not evenly divisable by size of work-given by 
local_work_size or ..."

I am not sure how it is supposed to work, but according to the error 
description "global should be evenly divisible by local", but since we have 
global=16 and local=32 they are not, hence the error.

Original issue reported on code.google.com by kenneth@hexad.dk on 15 Nov 2011 at 4:01

GoogleCodeExporter commented 9 years ago

Can I just confirm that your request was for Kernel.execute(16) so the 
globalSize we pass through JNI does match your request?

It looks like the calculation we do for localSize is failing if localSize is 
small (possibly less than 64).

As a work around specify 64 as your globalSize (kernel.execute(64)) and guard 
your kernel using 

new Kernel(){
   public void run(){
      if (getGlobalId()<16){
          // your code here
      }
   }
} ;

Apologies for this.  Clearly we need some test cases for low range values.  
Note that  unless your kernel is doing a lot of work (computation + loops) it 
is unlikely that a  kernel with such a small 'range' will be very performant.

Original comment by frost.g...@gmail.com on 15 Nov 2011 at 4:20

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Nope, my call was Kernel.execute(32, 1).

It looks like the value passed through JNI is correct, but it is changed in 
line 1073.

I have tried with sizes 32,64,128,256,512 and all have the same problem, they 
show "global=x/2 local=x" in the error message.

Original comment by kenneth@hexad.dk on 15 Nov 2011 at 4:23

GoogleCodeExporter commented 9 years ago

Oh bugger.  How did that ever work ;) Let me take a closer look.

Original comment by frost.g...@gmail.com on 15 Nov 2011 at 4:29

GoogleCodeExporter commented 9 years ago

So this is the remnants of me attempting to push compute across multiple 
devices.  I thought I had backed this code out before open sourcing. 

My intent was that I would dispatch half the compute to one device and half to 
another (your 6990 is seen as two separate GPU devices- you probably knew that 
already), but this required the Kernel to be very very careful and allow the 
buffers to be divided equally. 

I can fix this (i.e make it work), but I suspect that you will be dissapointed 
because the fix will mean that only one half of your GPU will be used (and any 
other dual device - I have a 5990 which I can test with here, which will 
exhibit the same error).

Clearly I have not tested enough with this card.

Original comment by frost.g...@gmail.com on 15 Nov 2011 at 4:36

GoogleCodeExporter commented 9 years ago

Here is a suggested hack.  To get you up and running 

Around line #446 in aparapi.cpp

// Get the # of devices
status = clGetDeviceIDs(platforms[i], deviceType, 0, NULL, &deviceIdc);
// now check if this platform supports the requested device type (GPU or CPU)
if (status == CL_SUCCESS && deviceIdc >0 ){
   platform = platforms[i];
   ...

Add 
   deviceIdc=1;

As the first statement in the conditional.  Giving us

// Get the # of devices
status = clGetDeviceIDs(platforms[i], deviceType, 0, NULL, &deviceIdc);
// now check if this platform supports the requested device type (GPU or CPU)
if (status == CL_SUCCESS && deviceIdc >0 ){
   deviceIdc=1;  // Hack here for issue #18
   platform = platforms[i];
   ...

Hopefully this will get you back up and running.  I need to decide whether to 
re-enable (and fix ) multiple device support or whether to remove it. This will 
need some more work.  

Again apologies for this, and also apologies that you are discovering all these 
bugs.   I do appreciate your help uncovering these.

Gary

Original comment by frost.g...@gmail.com on 15 Nov 2011 at 4:48

GoogleCodeExporter commented 9 years ago

It is a new project, I do not expect it to be free from bugs.

It is also a strange field working with high-level languages and low-level 
execution, so it will take some time for a project like this to mature and 
attract users.

Anyway, I am a PhD student, so I actually get paid for trying stuff like this 
and finding/fixing errors :)

If you want to fix it, there could be an issue with uneven workloads, say 4 
devices and global/local = 5, perhaps just revert to "single unit" or something 
in this case. It is also problematic that the data needs to be copied multiple 
times, and merging back the results could be a real problem.

I will apply the idc = 1 fix and re-compile the library and test tomorrow.

Thanks for making the project open-source and actually responding to these 
reports :)

Original comment by kenneth@hexad.dk on 15 Nov 2011 at 6:23

GoogleCodeExporter commented 9 years ago

Just to add some confusion.  I tested with my 5970 (I mistyped earlier when I 
referenced a 5990) it gets detected as two devices.  It worked (but was much 
slower) when sharing execution across devices. Mandel for example was 20fps 
instead of 55fps when I applied the suggested hack above.  NBody also slowed 
considerably.  

This needs a lot of thought, I agree that non balanced workloads will be even 
more scary. 

Maybe we need to expose the devices.  So the user can request multiple devices 
if they feel that it will benefit. I really wanted to avoid this.  

I note that JOCL has a method which discovers the device with max flops.. 
Another idea might be to run in both modes (assuming I/we fix the bug ;)) 
initially and then 'learn' which is most performant.  Hmmm

Let me know if the hack above at least works for you.

Gary

Original comment by frost.g...@gmail.com on 15 Nov 2011 at 6:35

GoogleCodeExporter commented 9 years ago

Revision #110 contains the above hack if you want to try it out. 

I guarded the warning behind the -Dcom.amd.aparapi.enableVerboseJNI=true flag 

Will keep this open, because (as indicate above) this is not a fix, just a 
workaround.

Original comment by frost.g...@gmail.com on 15 Nov 2011 at 6:46

GoogleCodeExporter commented 9 years ago

The workaround enables Aparapi to run the sample applications, and it is pretty 
fast on the AMD based machine, but the NVidia machine is now running slower 
than the JAVA version. The strange thing is that the JOCL version is running 
fast on both machines.

Original comment by kenneth@hexad.dk on 16 Nov 2011 at 2:04

GoogleCodeExporter commented 9 years ago

Does the NVidia machine report it's card as multiple devices? Is that why it is 
being negatively impacted by this workaround.

If so I guess we could make this 'hack' conditional? i.e only for AMD Devices 
if that helps.

Can we also confirm that the NVidia driver is OpenCL 1.1 ?

Original comment by frost.g...@gmail.com on 16 Nov 2011 at 5:37

GoogleCodeExporter commented 9 years ago

Yep, the NVidia machine reports the same "two devices", it did not work before 
the workaround, it gave the exact same error as the AMD machine.

Making the hack optional does not solve the issue, because then we go back to 
the original problem.

Yes, it reports OpenCL 1.1.

I will have a go tomorrow to try and figure out why this happens. I can compare 
the stuff done by JOCL to what Aparapi does and hopefully guess where it goes 
wrong.

Original comment by kenneth@hexad.dk on 16 Nov 2011 at 6:40

GoogleCodeExporter commented 9 years ago

After running some more tests, I can see that the NVidia machine does in fact 
offer a speedup.

On the AMD machine, the speedup obtained through Aparapi and JOCL is pretty 
much the same, with JOCL only being slightly faster (~2%).

On the NVidia machine the difference is much larger (~40%). After scaling the 
problem to a suitable size, there is a clear performance gain using either 
method though. So the hack does work correctly on the NVidia machine as well.

Looking at the generated OpenCL code, there is really no difference from the 
hand-generated OpenCL, except that the Aparapi version uses a few local 
variables. But this is not really related to the original issue though, and is 
likely just some special case where the NVidia kernel is slower.

Original comment by kenneth@hexad.dk on 25 Nov 2011 at 12:26

GoogleCodeExporter commented 9 years ago

Kenneth I think the recent Range related changes should have fixed this.  

Can you confirm for me.  

Gary

Original comment by frost.g...@gmail.com on 23 Feb 2012 at 8:13

GoogleCodeExporter commented 9 years ago

Based on final comment, and the fact that the last activity was over a year 
ago, this issue may likely be closed.

Original comment by lats...@gmail.com on 29 Mar 2013 at 11:35

GoogleCodeExporter commented 9 years ago

Yes, I think you can close it.
I do not have access to the machines that exhibited the problem anymore, so I 
cannot verify.

Original comment by kenneth@hexad.dk on 1 Apr 2013 at 11:19

GoogleCodeExporter commented 9 years ago

Original comment by ryan.lam...@gmail.com on 20 Apr 2013 at 12:31

Changed state: WontFix

PiRSquared17 / aparapi

clEnqueueNDRangeKernel() failed #18