Closed GoogleCodeExporter closed 9 years ago
Juergen,
Thanks for filing this. Especially for including verboseJNI output.
I think this is a known Apple issue (actually Aparapi issue that shows up only
on Apple OSX). The default groupsize we choose for Aparapi is 256, some
OpenCL/Apple implementations have a maximum of 128. The good news is we can
work around it using the new Range class.
So instead of calling
kernel.execute(n);
Which creates a default Range internally with the default of 256 for
group/localsize.
We can create a Range manually with 128 as the max group size.
Range range = range.create(n,128);
kernel.execute(range);
Or alternatively
kernel.execute(Range.create(m, 128));
Can you try this and see if it works for you? Unfortunately I did not have
MacOSX machine on hand when I chose the defaults. I did try to pick defaults
that would work for Linux/Windows AMD/NVidia drivers. Looks like Apple drivers
do not support our defaults.
Gary
Original comment by frost.g...@gmail.com
on 20 Jun 2012 at 5:44
I've tried what you suggested in the mandelbrot demo by replacing
final Range range = Range.create(width * height);
with
final Range range = Range.create(width * height, 128);
But this made no difference. There's still the warning that CL exec seems to
have failed.
Original comment by juergen....@stayfriends.com
on 21 Jun 2012 at 5:25
That is strange. So I was keying off this error message
!!!!!!! clEnqueueNDRangeKernel() failed invalid work group size
I am not sure if the clinfo command is available with Mac OSX and OpenCL
driver. If it is then you should try to execute it. It will report the
maximum group sizes/dimensions for your driver.
c:\clinfo
.....
Max work items dimensions: 3
Max work items[0]: 256
Max work items[1]: 256
Max work items[2]: 256
Max work group size: 256
....
Maybe group size of 128 is still too big. Maybe you can try 64? 32?
final Range range = Range.create(width * height, 64);
The only rule to obey is that (width*height)%groupsize must be 0. So
globalsize must be a multiple of group size.
If you don't have clinfo available and are using the latest Aparapi trunk you
can query the groupsize from Java using
OpenCLDevice device = (OpenCLDevice)Device.best();
System.out.println("dim 0 "+device.getMaxWorkItemSize()[0]);
System.out.println("dim 1 "+device.getMaxWorkItemSize()[1]);
System.out.println("dim 2 "+device.getMaxWorkItemSize()[2]);
Gary
Original comment by frost.g...@gmail.com
on 21 Jun 2012 at 6:38
Strange. With the latest trunk build all works perfect. Nevertheless here's the
output you requested:
dim 0 1024
dim 1 1024
dim 2 1024
So maybe we can just close this issue.
Original comment by juergen....@gmail.com
on 22 Jun 2012 at 5:34
That's scary, I am not sure we have applied any patches to directly fix this ;)
Do you have multiple GPUs ? Is it possible that Aparapi (which
previously chose the first GPU, and is now requesting the 'best' GPU)
is now using a different GPU?
Thanks for continuing to provide feedback. As you can tell, I don't
have access to Mac OSX machines. I really do welcome this information.
I will probably keep this open until we create binary downloads that
include this 'fix'.
Original comment by frost.g...@gmail.com
on 22 Jun 2012 at 5:27
Indeed my MacBook has a second GPU: Intel HD Graphics 3000 512 MB.
Original comment by juergen....@stayfriends.com
on 22 Jun 2012 at 5:50
Maybe this helps:
for (OpenCLPlatform platform : OpenCLPlatform.getPlatforms()) {
System.out.println("platform: " + platform);
for (OpenCLDevice device : platform.getDevices()) {
System.out.println(" device: " + device);
}
}
leads to:
platform: PlatformId 2147418112
Name:Apple
Version:OpenCL 1.1 (Apr 9 2012 19:41:45)
device: Device 4294967295
type:CPU
maxComputeUnits=8
maxWorkItemDimensions=3
maxWorkItemSizes={1024, 1, 1}
maxWorkWorkGroupSize=1024
globalMemSize=8589934592
localMemSize=32768
device: Device 16915200
type:GPU
maxComputeUnits=6
maxWorkItemDimensions=3
maxWorkItemSizes={1024, 1024, 1024}
maxWorkWorkGroupSize=1024
globalMemSize=1073741824
localMemSize=32768
Original comment by juergen....@stayfriends.com
on 22 Jun 2012 at 6:00
Could you try this utility http://codykrieger.com/gfxCardStatus and report if
it makes any difference when you force OS X to use either an integrated or
discrete GPU?
Original comment by ryan.lam...@gmail.com
on 23 Jul 2012 at 11:16
Using integrated or discrete GPU didn't change anything.
Using integrated GPU (Intel HD 3000):
platform name 0 Apple
platform version 0 OpenCL 1.1 (Apr 9 2012 19:41:45)
platform Apple supports requested device type
device[0xffffffff]: Type: CPU
in setArgs arg 0 width type 00000108
in setArgs arg 0 width is *not* local
in setArgs arg 1 scale type 00000104
in setArgs arg 1 scale is *not* local
in setArgs arg 2 offsetx type 00000104
in setArgs arg 2 offsetx is *not* local
in setArgs arg 3 height type 00000108
in setArgs arg 3 height is *not* local
in setArgs arg 4 offsety type 00000104
in setArgs arg 4 offsety is *not* local
in setArgs arg 5 rgb type 00001688
in setArgs arg 5 rgb is *not* local
in setArgs arg 6 pallette type 00002288
in setArgs arg 6 pallette is *not* local
got type for width: 00000108
got type for scale: 00000104
got type for offsetx: 00000104
got type for height: 00000108
got type for offsety: 00000104
got type for rgb: 00001688
testing for Resync javaArray rgb: old=0x0, new=0x7fd2e363d8e8
Resync javaArray for rgb: 0x7fd2e363d8e8 0x0
NewWeakGlobalRef for rgb, set to 0x7fd2e3415010
updateNonPrimitiveReferences, args[5].sizeInBytes=2359296
got type for pallette: 00002288
testing for Resync javaArray pallette: old=0x0, new=0x7fd2e363d8f0
Resync javaArray for pallette: 0x7fd2e363d8f0 0x0
NewWeakGlobalRef for pallette, set to 0x7fd2e3415018
updateNonPrimitiveReferences, args[6].sizeInBytes=260
back from updateNonPrimitiveReferences
got type for arg 0, width, type=00000108
clSetKernelArg width: 0 0 4 0x00000300
got type for arg 1, scale, type=00000104
clSetKernelArg scale: 1 1 4 0x40400000
got type for arg 2, offsetx, type=00000104
clSetKernelArg offsetx: 2 2 4 0xbf800000
got type for arg 3, height, type=00000108
clSetKernelArg height: 3 3 4 0x00000300
got type for arg 4, offsety, type=00000104
clSetKernelArg offsety: 4 4 4 0x00000000
got type for arg 5, rgb, type=00001688
runKernel: arrayOrBuf ref 0x7fd2e3415010, oldAddr=0x0, newAddr=0x7e61fafe8,
ref.mem=0x0, isArray=1
at memory addr 0x7e61fafe8, contents: 00 00 00 00 00 00 00 00
rgb 5 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_WRITE,
size=00240000 bytes, address=e61fafe8, &status)
writing buffer 5 rgb
got type for arg 6, pallette, type=00002288
runKernel: arrayOrBuf ref 0x7fd2e3415018, oldAddr=0x0, newAddr=0x7e6731750,
ref.mem=0x0, isArray=1
at memory addr 0x7e6731750, contents: 00 00 ff ff 00 18 ff ff
pallette 6 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_ONLY,
size=00000104 bytes, address=e6731750, &status)
!!!!!!! clEnqueueNDRangeKernel() failed invalid work group size
after clEnqueueNDRangeKernel, globalSize_0=589824 localSize_0=32
Jul 25, 2012 10:18:38 PM com.amd.aparapi.KernelRunner executeOpenCL
WARNING: ### CL exec seems to have failed. Trying to revert to Java ###
Execution mode=JTP
Using discrete (AMD 6750M):
platform name 0 Apple
platform version 0 OpenCL 1.1 (Apr 9 2012 19:41:45)
platform Apple supports requested device type
device[0xffffffff]: Type: CPU
in setArgs arg 0 width type 00000108
in setArgs arg 0 width is *not* local
in setArgs arg 1 scale type 00000104
in setArgs arg 1 scale is *not* local
in setArgs arg 2 offsetx type 00000104
in setArgs arg 2 offsetx is *not* local
in setArgs arg 3 height type 00000108
in setArgs arg 3 height is *not* local
in setArgs arg 4 offsety type 00000104
in setArgs arg 4 offsety is *not* local
in setArgs arg 5 rgb type 00001688
in setArgs arg 5 rgb is *not* local
in setArgs arg 6 pallette type 00002288
in setArgs arg 6 pallette is *not* local
got type for width: 00000108
got type for scale: 00000104
got type for offsetx: 00000104
got type for height: 00000108
got type for offsety: 00000104
got type for rgb: 00001688
testing for Resync javaArray rgb: old=0x0, new=0x7fc88040be28
Resync javaArray for rgb: 0x7fc88040be28 0x0
NewWeakGlobalRef for rgb, set to 0x7fc8823026e0
updateNonPrimitiveReferences, args[5].sizeInBytes=2359296
got type for pallette: 00002288
testing for Resync javaArray pallette: old=0x0, new=0x7fc88040be30
Resync javaArray for pallette: 0x7fc88040be30 0x0
NewWeakGlobalRef for pallette, set to 0x7fc8823026e8
updateNonPrimitiveReferences, args[6].sizeInBytes=260
back from updateNonPrimitiveReferences
got type for arg 0, width, type=00000108
clSetKernelArg width: 0 0 4 0x00000300
got type for arg 1, scale, type=00000104
clSetKernelArg scale: 1 1 4 0x40400000
got type for arg 2, offsetx, type=00000104
clSetKernelArg offsetx: 2 2 4 0xbf800000
got type for arg 3, height, type=00000108
clSetKernelArg height: 3 3 4 0x00000300
got type for arg 4, offsety, type=00000104
clSetKernelArg offsety: 4 4 4 0x00000000
got type for arg 5, rgb, type=00001688
runKernel: arrayOrBuf ref 0x7fc8823026e0, oldAddr=0x0, newAddr=0x7e5ffedc0,
ref.mem=0x0, isArray=1
at memory addr 0x7e5ffedc0, contents: 00 00 00 00 00 00 00 00
rgb 5 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_WRITE,
size=00240000 bytes, address=e5ffedc0, &status)
writing buffer 5 rgb
got type for arg 6, pallette, type=00002288
runKernel: arrayOrBuf ref 0x7fc8823026e8, oldAddr=0x0, newAddr=0x7e6532d20,
ref.mem=0x0, isArray=1
at memory addr 0x7e6532d20, contents: 00 00 ff ff 00 18 ff ff
pallette 6 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_ONLY,
size=00000104 bytes, address=e6532d20, &status)
!!!!!!! clEnqueueNDRangeKernel() failed invalid work group size
after clEnqueueNDRangeKernel, globalSize_0=589824 localSize_0=32
Jul 25, 2012 10:19:49 PM com.amd.aparapi.KernelRunner executeOpenCL
WARNING: ### CL exec seems to have failed. Trying to revert to Java ###
Execution mode=JTP
Original comment by juergen....@gmail.com
on 25 Jul 2012 at 8:21
So at this point I think we need to drop to OpenCL and see if a simple OpenCL
example will run.
The error 'invalid work group size' looks to be bogus. You are asking for
589824 items in groups of 32 (589824%32 == 0 which is good). I cannot see why
this should be a problem.
I would take a simple opencl example such as this one from Apple
http://developer.apple.com/library/mac/#samplecode/OpenCL_Hello_World_Example/In
troduction/Intro.html
And see if you can get it to run.
Then I would increase the test code to match your global/local size (589824,32)
to see if it will run.
This looks like a driver issue to me.. Sorry I can't be more help.
Original comment by frost.g...@gmail.com
on 25 Jul 2012 at 10:45
I've managed to get sample code running. With the default settings the output
is (I added a debug message to print the local work group size):
local work group size: 256
Computed '1024/1024' correct values!
When I increase the global work group size to 589824 (instead of 1024) the
output is:
local work group size: 256
Computed '589824/589824' correct values!
Setting the local work group size to 32:
local work group size: 32
Computed '589824/589824' correct values!
Original comment by juergen....@gmail.com
on 26 Jul 2012 at 6:48
To be clear, by sample code you mean the sample C & OpenCL demo from
http://developer.apple.com/library/mac/#samplecode/OpenCL_Hello_World_Example/In
troduction/Intro.html
Hmm. So this clearly is an Aparapi issue. At this point I would normally be
trying to single step through the JNI code to see if I can see what is
happening. But alas I do not have access to an OSX machine..
Juergen are you building your Aparapi from the source/trunk?
If so would you be interested/prepared to hack some JNI code with me to track
this down?
Original comment by frost.g...@gmail.com
on 26 Jul 2012 at 8:26
Yes I meant the sample code you posted. I used the prebuilt library from the
downloads page. Using the latest source/trunk all runs perfect. So just let me
know how I can assist you. I'm interested and prepared to hack some JNI code :-)
Original comment by juergen....@gmail.com
on 26 Jul 2012 at 8:31
I definitely want to know how this ends -- I have the same machine, and want to
get some aparapi code working.
Original comment by keybou...@gmail.com
on 11 Oct 2012 at 1:45
Please try current trunk version (rev. 864) - I've entered a change that
queries underlaying OpenCL driver for maximum work group sizes, and limits them
according to implementation capabilities.
Original comment by wb...@jitsolutions.pl
on 14 Nov 2012 at 10:29
Works like a charme now! I think we can close this issue.
Original comment by juergen....@gmail.com
on 15 Nov 2012 at 6:45
Original comment by ryan.lam...@gmail.com
on 20 Apr 2013 at 12:21
Original issue reported on code.google.com by
juergen....@gmail.com
on 20 Jun 2012 at 5:49