Closed GoogleCodeExporter closed 9 years ago
Can I just confirm that your request was for Kernel.execute(16) so the
globalSize we pass through JNI does match your request?
It looks like the calculation we do for localSize is failing if localSize is
small (possibly less than 64).
As a work around specify 64 as your globalSize (kernel.execute(64)) and guard
your kernel using
new Kernel(){
public void run(){
if (getGlobalId()<16){
// your code here
}
}
} ;
Apologies for this. Clearly we need some test cases for low range values.
Note that unless your kernel is doing a lot of work (computation + loops) it
is unlikely that a kernel with such a small 'range' will be very performant.
Original comment by frost.g...@gmail.com
on 15 Nov 2011 at 4:20
Nope, my call was Kernel.execute(32, 1).
It looks like the value passed through JNI is correct, but it is changed in
line 1073.
I have tried with sizes 32,64,128,256,512 and all have the same problem, they
show "global=x/2 local=x" in the error message.
Original comment by kenneth@hexad.dk
on 15 Nov 2011 at 4:23
Oh bugger. How did that ever work ;) Let me take a closer look.
Original comment by frost.g...@gmail.com
on 15 Nov 2011 at 4:29
So this is the remnants of me attempting to push compute across multiple
devices. I thought I had backed this code out before open sourcing.
My intent was that I would dispatch half the compute to one device and half to
another (your 6990 is seen as two separate GPU devices- you probably knew that
already), but this required the Kernel to be very very careful and allow the
buffers to be divided equally.
I can fix this (i.e make it work), but I suspect that you will be dissapointed
because the fix will mean that only one half of your GPU will be used (and any
other dual device - I have a 5990 which I can test with here, which will
exhibit the same error).
Clearly I have not tested enough with this card.
Original comment by frost.g...@gmail.com
on 15 Nov 2011 at 4:36
Here is a suggested hack. To get you up and running
Around line #446 in aparapi.cpp
// Get the # of devices
status = clGetDeviceIDs(platforms[i], deviceType, 0, NULL, &deviceIdc);
// now check if this platform supports the requested device type (GPU or CPU)
if (status == CL_SUCCESS && deviceIdc >0 ){
platform = platforms[i];
...
Add
deviceIdc=1;
As the first statement in the conditional. Giving us
// Get the # of devices
status = clGetDeviceIDs(platforms[i], deviceType, 0, NULL, &deviceIdc);
// now check if this platform supports the requested device type (GPU or CPU)
if (status == CL_SUCCESS && deviceIdc >0 ){
deviceIdc=1; // Hack here for issue #18
platform = platforms[i];
...
Hopefully this will get you back up and running. I need to decide whether to
re-enable (and fix ) multiple device support or whether to remove it. This will
need some more work.
Again apologies for this, and also apologies that you are discovering all these
bugs. I do appreciate your help uncovering these.
Gary
Original comment by frost.g...@gmail.com
on 15 Nov 2011 at 4:48
It is a new project, I do not expect it to be free from bugs.
It is also a strange field working with high-level languages and low-level
execution, so it will take some time for a project like this to mature and
attract users.
Anyway, I am a PhD student, so I actually get paid for trying stuff like this
and finding/fixing errors :)
If you want to fix it, there could be an issue with uneven workloads, say 4
devices and global/local = 5, perhaps just revert to "single unit" or something
in this case. It is also problematic that the data needs to be copied multiple
times, and merging back the results could be a real problem.
I will apply the idc = 1 fix and re-compile the library and test tomorrow.
Thanks for making the project open-source and actually responding to these
reports :)
Original comment by kenneth@hexad.dk
on 15 Nov 2011 at 6:23
Just to add some confusion. I tested with my 5970 (I mistyped earlier when I
referenced a 5990) it gets detected as two devices. It worked (but was much
slower) when sharing execution across devices. Mandel for example was 20fps
instead of 55fps when I applied the suggested hack above. NBody also slowed
considerably.
This needs a lot of thought, I agree that non balanced workloads will be even
more scary.
Maybe we need to expose the devices. So the user can request multiple devices
if they feel that it will benefit. I really wanted to avoid this.
I note that JOCL has a method which discovers the device with max flops..
Another idea might be to run in both modes (assuming I/we fix the bug ;))
initially and then 'learn' which is most performant. Hmmm
Let me know if the hack above at least works for you.
Gary
Original comment by frost.g...@gmail.com
on 15 Nov 2011 at 6:35
Revision #110 contains the above hack if you want to try it out.
I guarded the warning behind the -Dcom.amd.aparapi.enableVerboseJNI=true flag
Will keep this open, because (as indicate above) this is not a fix, just a
workaround.
Original comment by frost.g...@gmail.com
on 15 Nov 2011 at 6:46
The workaround enables Aparapi to run the sample applications, and it is pretty
fast on the AMD based machine, but the NVidia machine is now running slower
than the JAVA version. The strange thing is that the JOCL version is running
fast on both machines.
Original comment by kenneth@hexad.dk
on 16 Nov 2011 at 2:04
Does the NVidia machine report it's card as multiple devices? Is that why it is
being negatively impacted by this workaround.
If so I guess we could make this 'hack' conditional? i.e only for AMD Devices
if that helps.
Can we also confirm that the NVidia driver is OpenCL 1.1 ?
Original comment by frost.g...@gmail.com
on 16 Nov 2011 at 5:37
Yep, the NVidia machine reports the same "two devices", it did not work before
the workaround, it gave the exact same error as the AMD machine.
Making the hack optional does not solve the issue, because then we go back to
the original problem.
Yes, it reports OpenCL 1.1.
I will have a go tomorrow to try and figure out why this happens. I can compare
the stuff done by JOCL to what Aparapi does and hopefully guess where it goes
wrong.
Original comment by kenneth@hexad.dk
on 16 Nov 2011 at 6:40
After running some more tests, I can see that the NVidia machine does in fact
offer a speedup.
On the AMD machine, the speedup obtained through Aparapi and JOCL is pretty
much the same, with JOCL only being slightly faster (~2%).
On the NVidia machine the difference is much larger (~40%). After scaling the
problem to a suitable size, there is a clear performance gain using either
method though. So the hack does work correctly on the NVidia machine as well.
Looking at the generated OpenCL code, there is really no difference from the
hand-generated OpenCL, except that the Aparapi version uses a few local
variables. But this is not really related to the original issue though, and is
likely just some special case where the NVidia kernel is slower.
Original comment by kenneth@hexad.dk
on 25 Nov 2011 at 12:26
Kenneth I think the recent Range related changes should have fixed this.
Can you confirm for me.
Gary
Original comment by frost.g...@gmail.com
on 23 Feb 2012 at 8:13
Based on final comment, and the fact that the last activity was over a year
ago, this issue may likely be closed.
Original comment by lats...@gmail.com
on 29 Mar 2013 at 11:35
Yes, I think you can close it.
I do not have access to the machines that exhibited the problem anymore, so I
cannot verify.
Original comment by kenneth@hexad.dk
on 1 Apr 2013 at 11:19
Original comment by ryan.lam...@gmail.com
on 20 Apr 2013 at 12:31
Original issue reported on code.google.com by
kenneth@hexad.dk
on 15 Nov 2011 at 4:01