PiRSquared17 / aparapi

Automatically exported from code.google.com/p/aparapi
Other
0 stars 0 forks source link

Latest release aparapi-2012-05-06 runs samples only in CPU mode on Mac OS X #55

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Download latest release aparapi-2012-05-06.zip
2. Run for example the mandelbrot sample

What is the expected output? What do you see instead?
Expected output: the samples should utilize the GPU. Instead an warning is 
emitted that it reverted to CPU mode:

platform name    0 Apple
platform version 0 OpenCL 1.1 (Apr  9 2012 19:41:45)
platform Apple supports requested device type
device[0xffffffff]: Type: CPU 
in setArgs arg 0 width type 00000108
in setArgs arg 0 width is *not* local
in setArgs arg 1 scale type 00000104
in setArgs arg 1 scale is *not* local
in setArgs arg 2 offsetx type 00000104
in setArgs arg 2 offsetx is *not* local
in setArgs arg 3 height type 00000108
in setArgs arg 3 height is *not* local
in setArgs arg 4 offsety type 00000104
in setArgs arg 4 offsety is *not* local
in setArgs arg 5 rgb type 00001688
in setArgs arg 5 rgb is *not* local
in setArgs arg 6 pallette type 00002288
in setArgs arg 6 pallette is *not* local
got type for width: 00000108
got type for scale: 00000104
got type for offsetx: 00000104
got type for height: 00000108
got type for offsety: 00000104
got type for rgb: 00001688
testing for Resync javaArray rgb: old=0x0, new=0x7f9a21c3df18
Resync javaArray for rgb: 0x7f9a21c3df18  0x0
NewWeakGlobalRef for rgb, set to 0x7f9a21c3e7a8
updateNonPrimitiveReferences, args[5].sizeInBytes=2359296
got type for pallette: 00002288
testing for Resync javaArray pallette: old=0x0, new=0x7f9a21c3df20
Resync javaArray for pallette: 0x7f9a21c3df20  0x0
NewWeakGlobalRef for pallette, set to 0x7f9a21c3e7b0
updateNonPrimitiveReferences, args[6].sizeInBytes=260
back from updateNonPrimitiveReferences
got type for arg 0, width, type=00000108
clSetKernelArg width: 0 0 4 0x00000300
got type for arg 1, scale, type=00000104
clSetKernelArg scale: 1 1 4 0x40400000
got type for arg 2, offsetx, type=00000104
clSetKernelArg offsetx: 2 2 4 0xbf800000
got type for arg 3, height, type=00000108
clSetKernelArg height: 3 3 4 0x00000300
got type for arg 4, offsety, type=00000104
clSetKernelArg offsety: 4 4 4 0x00000000
got type for arg 5, rgb, type=00001688
runKernel: arrayOrBuf ref 0x7f9a21c3e7a8, oldAddr=0x0, newAddr=0x7e6b41ac0, 
ref.mem=0x0, isArray=1
at memory addr 0x7e6b41ac0, contents: 00 00 00 00 00 00 00 00 
rgb 5 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_WRITE, 
size=00240000 bytes, address=e6b41ac0, &status)
 writing buffer 5 rgb
got type for arg 6, pallette, type=00002288
runKernel: arrayOrBuf ref 0x7f9a21c3e7b0, oldAddr=0x0, newAddr=0x7e7080460, 
ref.mem=0x0, isArray=1
at memory addr 0x7e7080460, contents: 00 00 ff ff 00 18 ff ff 
pallette 6 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_ONLY, 
size=00000104 bytes, address=e7080460, &status)
!!!!!!! clEnqueueNDRangeKernel() failed invalid work group size
after clEnqueueNDRangeKernel, globalSize_0=589824 localSize_0=128
Jun 20, 2012 7:44:19 AM com.amd.aparapi.KernelRunner executeOpenCL
WARNING: ### CL exec seems to have failed. Trying to revert to Java ###
Execution mode=JTP

What version of the product are you using? On what operating system?
* aparapi-2012-05-06
* Java: 1.7.0_06-ea-b14
* Mac OS X 10.7.4
* Intel Core i7 2.2 GHz, 8 GB Ram, AMD Radeon HD 6750M 1024 MB

Please provide any additional information below.

Original issue reported on code.google.com by juergen....@gmail.com on 20 Jun 2012 at 5:49

GoogleCodeExporter commented 9 years ago
Juergen,

Thanks for filing this. Especially for including verboseJNI output. 

I think this is a known Apple issue (actually Aparapi issue that shows up only 
on Apple OSX).  The default groupsize we choose for Aparapi is 256, some 
OpenCL/Apple implementations have a maximum of 128.  The good news is we can 
work around it using the new Range class. 

So instead of calling

kernel.execute(n);

Which creates a default Range internally with the default of 256 for 
group/localsize. 

We can create a Range manually with 128 as the max group size. 

Range range = range.create(n,128);
kernel.execute(range);

Or alternatively 

kernel.execute(Range.create(m, 128));

Can you try this and see if it works for you? Unfortunately I did not have 
MacOSX machine on hand when I chose the defaults.  I did try to pick defaults 
that would work for Linux/Windows AMD/NVidia drivers.  Looks like Apple drivers 
do not support our defaults. 

Gary

Original comment by frost.g...@gmail.com on 20 Jun 2012 at 5:44

GoogleCodeExporter commented 9 years ago
I've tried what you suggested in the mandelbrot demo by replacing

final Range range = Range.create(width * height);

with

final Range range = Range.create(width * height, 128);

But this made no difference. There's still the warning that CL exec seems to 
have failed.

Original comment by juergen....@stayfriends.com on 21 Jun 2012 at 5:25

GoogleCodeExporter commented 9 years ago
That is strange.  So I was keying off this error message 

!!!!!!! clEnqueueNDRangeKernel() failed invalid work group size

I am not sure if the clinfo command is available with Mac OSX and OpenCL 
driver.  If it is then you should try to execute it.  It will report the 
maximum group sizes/dimensions for your driver. 

c:\clinfo
.....
  Max work items dimensions:                     3
    Max work items[0]:                           256
    Max work items[1]:                           256
    Max work items[2]:                           256
  Max work group size:                           256
....

Maybe group size of 128 is still too big.  Maybe you can try 64? 32?

final Range range = Range.create(width * height, 64);

The only rule to obey is that (width*height)%groupsize must be 0.  So 
globalsize must be a multiple of group size.

If you don't have clinfo available and are using the latest Aparapi trunk you 
can query the groupsize from Java using 

OpenCLDevice device  = (OpenCLDevice)Device.best();
System.out.println("dim 0 "+device.getMaxWorkItemSize()[0]);
System.out.println("dim 1 "+device.getMaxWorkItemSize()[1]);
System.out.println("dim 2 "+device.getMaxWorkItemSize()[2]);

Gary

Original comment by frost.g...@gmail.com on 21 Jun 2012 at 6:38

GoogleCodeExporter commented 9 years ago
Strange. With the latest trunk build all works perfect. Nevertheless here's the 
output you requested:

dim 0 1024
dim 1 1024
dim 2 1024

So maybe we can just close this issue.

Original comment by juergen....@gmail.com on 22 Jun 2012 at 5:34

GoogleCodeExporter commented 9 years ago
That's scary, I am not sure we have applied any patches to directly fix this ;)

Do you have multiple GPUs ?  Is it possible that Aparapi (which
previously chose the first GPU, and is now requesting the 'best' GPU)
is now using a different GPU?

Thanks for continuing to provide feedback.  As you can tell, I don't
have access to Mac OSX machines. I really do welcome this information.

I will probably keep this open until we create binary downloads that
include this 'fix'.

Original comment by frost.g...@gmail.com on 22 Jun 2012 at 5:27

GoogleCodeExporter commented 9 years ago
Indeed my MacBook has a second GPU: Intel HD Graphics 3000 512 MB.

Original comment by juergen....@stayfriends.com on 22 Jun 2012 at 5:50

GoogleCodeExporter commented 9 years ago
Maybe this helps:

   for (OpenCLPlatform platform : OpenCLPlatform.getPlatforms()) {
        System.out.println("platform: " + platform);
       for (OpenCLDevice device : platform.getDevices()) {
            System.out.println("  device: " + device);
       }
   }

leads to:

platform: PlatformId 2147418112
Name:Apple
Version:OpenCL 1.1 (Apr  9 2012 19:41:45)
  device: Device 4294967295
  type:CPU
  maxComputeUnits=8
  maxWorkItemDimensions=3
  maxWorkItemSizes={1024, 1, 1}
  maxWorkWorkGroupSize=1024
  globalMemSize=8589934592
  localMemSize=32768
  device: Device 16915200
  type:GPU
  maxComputeUnits=6
  maxWorkItemDimensions=3
  maxWorkItemSizes={1024, 1024, 1024}
  maxWorkWorkGroupSize=1024
  globalMemSize=1073741824
  localMemSize=32768

Original comment by juergen....@stayfriends.com on 22 Jun 2012 at 6:00

GoogleCodeExporter commented 9 years ago
Could you try this utility http://codykrieger.com/gfxCardStatus and report if 
it makes any difference when you force OS X to use either an integrated or 
discrete GPU?

Original comment by ryan.lam...@gmail.com on 23 Jul 2012 at 11:16

GoogleCodeExporter commented 9 years ago
Using integrated or discrete GPU didn't change anything.

Using integrated GPU (Intel HD 3000):

platform name    0 Apple
platform version 0 OpenCL 1.1 (Apr  9 2012 19:41:45)
platform Apple supports requested device type
device[0xffffffff]: Type: CPU 
in setArgs arg 0 width type 00000108
in setArgs arg 0 width is *not* local
in setArgs arg 1 scale type 00000104
in setArgs arg 1 scale is *not* local
in setArgs arg 2 offsetx type 00000104
in setArgs arg 2 offsetx is *not* local
in setArgs arg 3 height type 00000108
in setArgs arg 3 height is *not* local
in setArgs arg 4 offsety type 00000104
in setArgs arg 4 offsety is *not* local
in setArgs arg 5 rgb type 00001688
in setArgs arg 5 rgb is *not* local
in setArgs arg 6 pallette type 00002288
in setArgs arg 6 pallette is *not* local
got type for width: 00000108
got type for scale: 00000104
got type for offsetx: 00000104
got type for height: 00000108
got type for offsety: 00000104
got type for rgb: 00001688
testing for Resync javaArray rgb: old=0x0, new=0x7fd2e363d8e8
Resync javaArray for rgb: 0x7fd2e363d8e8  0x0
NewWeakGlobalRef for rgb, set to 0x7fd2e3415010
updateNonPrimitiveReferences, args[5].sizeInBytes=2359296
got type for pallette: 00002288
testing for Resync javaArray pallette: old=0x0, new=0x7fd2e363d8f0
Resync javaArray for pallette: 0x7fd2e363d8f0  0x0
NewWeakGlobalRef for pallette, set to 0x7fd2e3415018
updateNonPrimitiveReferences, args[6].sizeInBytes=260
back from updateNonPrimitiveReferences
got type for arg 0, width, type=00000108
clSetKernelArg width: 0 0 4 0x00000300
got type for arg 1, scale, type=00000104
clSetKernelArg scale: 1 1 4 0x40400000
got type for arg 2, offsetx, type=00000104
clSetKernelArg offsetx: 2 2 4 0xbf800000
got type for arg 3, height, type=00000108
clSetKernelArg height: 3 3 4 0x00000300
got type for arg 4, offsety, type=00000104
clSetKernelArg offsety: 4 4 4 0x00000000
got type for arg 5, rgb, type=00001688
runKernel: arrayOrBuf ref 0x7fd2e3415010, oldAddr=0x0, newAddr=0x7e61fafe8, 
ref.mem=0x0, isArray=1
at memory addr 0x7e61fafe8, contents: 00 00 00 00 00 00 00 00 
rgb 5 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_WRITE, 
size=00240000 bytes, address=e61fafe8, &status)
 writing buffer 5 rgb
got type for arg 6, pallette, type=00002288
runKernel: arrayOrBuf ref 0x7fd2e3415018, oldAddr=0x0, newAddr=0x7e6731750, 
ref.mem=0x0, isArray=1
at memory addr 0x7e6731750, contents: 00 00 ff ff 00 18 ff ff 
pallette 6 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_ONLY, 
size=00000104 bytes, address=e6731750, &status)
!!!!!!! clEnqueueNDRangeKernel() failed invalid work group size
after clEnqueueNDRangeKernel, globalSize_0=589824 localSize_0=32
Jul 25, 2012 10:18:38 PM com.amd.aparapi.KernelRunner executeOpenCL
WARNING: ### CL exec seems to have failed. Trying to revert to Java ###
Execution mode=JTP

Using discrete (AMD 6750M):

platform name    0 Apple
platform version 0 OpenCL 1.1 (Apr  9 2012 19:41:45)
platform Apple supports requested device type
device[0xffffffff]: Type: CPU 
in setArgs arg 0 width type 00000108
in setArgs arg 0 width is *not* local
in setArgs arg 1 scale type 00000104
in setArgs arg 1 scale is *not* local
in setArgs arg 2 offsetx type 00000104
in setArgs arg 2 offsetx is *not* local
in setArgs arg 3 height type 00000108
in setArgs arg 3 height is *not* local
in setArgs arg 4 offsety type 00000104
in setArgs arg 4 offsety is *not* local
in setArgs arg 5 rgb type 00001688
in setArgs arg 5 rgb is *not* local
in setArgs arg 6 pallette type 00002288
in setArgs arg 6 pallette is *not* local
got type for width: 00000108
got type for scale: 00000104
got type for offsetx: 00000104
got type for height: 00000108
got type for offsety: 00000104
got type for rgb: 00001688
testing for Resync javaArray rgb: old=0x0, new=0x7fc88040be28
Resync javaArray for rgb: 0x7fc88040be28  0x0
NewWeakGlobalRef for rgb, set to 0x7fc8823026e0
updateNonPrimitiveReferences, args[5].sizeInBytes=2359296
got type for pallette: 00002288
testing for Resync javaArray pallette: old=0x0, new=0x7fc88040be30
Resync javaArray for pallette: 0x7fc88040be30  0x0
NewWeakGlobalRef for pallette, set to 0x7fc8823026e8
updateNonPrimitiveReferences, args[6].sizeInBytes=260
back from updateNonPrimitiveReferences
got type for arg 0, width, type=00000108
clSetKernelArg width: 0 0 4 0x00000300
got type for arg 1, scale, type=00000104
clSetKernelArg scale: 1 1 4 0x40400000
got type for arg 2, offsetx, type=00000104
clSetKernelArg offsetx: 2 2 4 0xbf800000
got type for arg 3, height, type=00000108
clSetKernelArg height: 3 3 4 0x00000300
got type for arg 4, offsety, type=00000104
clSetKernelArg offsety: 4 4 4 0x00000000
got type for arg 5, rgb, type=00001688
runKernel: arrayOrBuf ref 0x7fc8823026e0, oldAddr=0x0, newAddr=0x7e5ffedc0, 
ref.mem=0x0, isArray=1
at memory addr 0x7e5ffedc0, contents: 00 00 00 00 00 00 00 00 
rgb 5 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_WRITE, 
size=00240000 bytes, address=e5ffedc0, &status)
 writing buffer 5 rgb
got type for arg 6, pallette, type=00002288
runKernel: arrayOrBuf ref 0x7fc8823026e8, oldAddr=0x0, newAddr=0x7e6532d20, 
ref.mem=0x0, isArray=1
at memory addr 0x7e6532d20, contents: 00 00 ff ff 00 18 ff ff 
pallette 6 clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_READ_ONLY, 
size=00000104 bytes, address=e6532d20, &status)
!!!!!!! clEnqueueNDRangeKernel() failed invalid work group size
after clEnqueueNDRangeKernel, globalSize_0=589824 localSize_0=32
Jul 25, 2012 10:19:49 PM com.amd.aparapi.KernelRunner executeOpenCL
WARNING: ### CL exec seems to have failed. Trying to revert to Java ###
Execution mode=JTP

Original comment by juergen....@gmail.com on 25 Jul 2012 at 8:21

GoogleCodeExporter commented 9 years ago
So at this point I think we need to drop to OpenCL and see if a simple OpenCL 
example will run. 

The error 'invalid work group size' looks to be bogus.  You are asking for 
589824 items in groups of 32 (589824%32 == 0 which is good).  I cannot see why 
this should be a problem.   

I would take a simple opencl example such as this one from Apple

http://developer.apple.com/library/mac/#samplecode/OpenCL_Hello_World_Example/In
troduction/Intro.html

And see if you can get it to run. 

Then I would increase the test code to match your global/local size (589824,32) 
to see if it will run.

This looks like a driver issue to me.. Sorry I can't be more help. 

Original comment by frost.g...@gmail.com on 25 Jul 2012 at 10:45

GoogleCodeExporter commented 9 years ago
I've managed to get sample code running. With the default settings the output 
is (I added a debug message to print the local work group size):

local work group size: 256
Computed '1024/1024' correct values!

When I increase the global work group size to 589824 (instead of 1024) the 
output is:

local work group size: 256
Computed '589824/589824' correct values!

Setting the local work group size to 32:

local work group size: 32
Computed '589824/589824' correct values!

Original comment by juergen....@gmail.com on 26 Jul 2012 at 6:48

GoogleCodeExporter commented 9 years ago
To be clear, by sample code you mean the sample C & OpenCL demo from 

http://developer.apple.com/library/mac/#samplecode/OpenCL_Hello_World_Example/In
troduction/Intro.html

Hmm.  So this clearly is an Aparapi issue. At this point I would normally be 
trying to single step through the JNI code to see if I can see what is 
happening.  But alas I do not have access to an OSX machine..

Juergen are you building your Aparapi from the source/trunk? 

If so would you be interested/prepared to hack some JNI code with me to track 
this down?

Original comment by frost.g...@gmail.com on 26 Jul 2012 at 8:26

GoogleCodeExporter commented 9 years ago
Yes I meant the sample code you posted. I used the prebuilt library from the 
downloads page. Using the latest source/trunk all runs perfect. So just let me 
know how I can assist you. I'm interested and prepared to hack some JNI code :-)

Original comment by juergen....@gmail.com on 26 Jul 2012 at 8:31

GoogleCodeExporter commented 9 years ago
I definitely want to know how this ends -- I have the same machine, and want to 
get some aparapi code working.

Original comment by keybou...@gmail.com on 11 Oct 2012 at 1:45

GoogleCodeExporter commented 9 years ago
Please try current trunk version (rev. 864) - I've entered a change that 
queries underlaying OpenCL driver for maximum work group sizes, and limits them 
according to implementation capabilities. 

Original comment by wb...@jitsolutions.pl on 14 Nov 2012 at 10:29

GoogleCodeExporter commented 9 years ago
Works like a charme now! I think we can close this issue.

Original comment by juergen....@gmail.com on 15 Nov 2012 at 6:45

GoogleCodeExporter commented 9 years ago

Original comment by ryan.lam...@gmail.com on 20 Apr 2013 at 12:21