Open freemo opened 7 years ago
IT should be noted this is not simply an issue with rolling over. if I change the calculate line to return (long) tc + 212600881053l;
instead it should produce the same result mathematically using addition rather than multiplication. Despite this the program actually runs successfully with this new edit producing the following result:
expected: 214748364700 result: 214748364700
Hi, thanks for opening this issue to here.
I think there is something wrong in OpenCl. I am sending kernel and host code in C language. It STILL gives wrong results on GPU.
What opencl implementation are you using and what cpu/gpu does the macbook have?
It's default implementation in macOS. I'm using 10.12 (16A320) MacBook Pro (Retina, 13-inch, Late 2013). It has i5 4258U CPU and Iris 5100 GPU
Are you using the amd app sdk?
Also in the calculate can you try it wi brackets around the cast, putting the multiplication outside of the brackets?
@CC007 The issue seems to occur on linux as well and on AMD App SDK. It appears this bug is not platform specific. IF you runt he code in the original post on your local computer will probably see the bug as well. Did you try? the bug behaves very oddly for me. for example even though id is always 0 if you remove the "+ id" part int he calculate call it wont break anymore.
Anyone check the c code i posted?
@savaskoc Not yet but i will give it a go this evening. If your saying it produces the same incorrect results however then I expect you to be correct that it is an opencl issue directly.
I think this is an OpenCl issue because GPU produces incorrect results independently from platform and/or language (C, Java, Python etc.).
@savaskoc You realize the bug is producing the result but clipping it to 32 bits? To use the test I posted above as the example (the same is true for the numbers you posted) here is the breakdown.
Expected result is 214748364700 which in binary would be:
0011 0001 1111 1111 1111 1111 1111 1111 1001 1100
actual result we get is 4294967196 which in binary is:
0000 0000 1111 1111 1111 1111 1111 1111 1001 1100
Basically just drops all but the last 32 bits.
So while this is a legitimate bug, it is definitely occurring due to mishandling of 64bit variables.
I'm aware of that. I tried another types than long but still gives incorrect results. Maybe some GPU's can't process 64 bit data types?
@savaskoc I think there is more to it than that. I think when i was testing i tried some variants that produced some very odd results. I need to test this again to make sure I'm remembering correctly but when I removed the id variable (which in this test is always 0 anyway so shouldnt make a difference) it actually caused the correct results to be produced. Once I saw that behavior it was apparent that we were talking about a legitimate bug, I just cant confirm yet if the bug is isolated to aparapi or an OpenCL issue in general yet (I need to do more testing).
@savaskoc also as I stated in the OP if you change the operation to addition but change the value of the operand to make it mathematically equivalent, it magically works. So the problem seems to occur only on multiplication but not addition. this leads me to believe it is a genuine error rather than a hardware compatibility issue or something.
Does opencl provide software 64bit support or hardware 64bit support (or both)?
My guess is that savaskoc is correct. I think that the GPU OpenCL runtime does not support long.
Aparapi will detect this for doubles, surprised it does not detect this for long.
@grfrost If I am correct, why (long) tc + 212600881053l;
line works but (long) tc * 100;
not?
Actually, I was about to retract ;) from the 1.0 spec https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/scalarDataTypes.html it does not look like cl_long is optional. So now I think I will blame Aparapi. My guess is that as the AST is built we end up with
(long)
|
*
tc 100
Instead of
* / \ (long) 100 | tc
Can you try
return 100 * (long) tc ;
and/or
return (long)(tc+0L) * 100;
(hope my diagrams make it unscathed)
I doubt. even original OpenCl does not produce correct results. Check my kernel above.
I ran the code on my computer (the code from freemo) and it runs as it should.
Using: CPU: 4670k GPU: 1070GTX Nvidia and intel openCL implementation
@CC007 You're saying that you get same results both GPU and CPU mode right? Can you try my kernel in C?
I will try that one next. I also tested freemo's code by setting it up to use the cpu opencl device
It generated within build phase. Did you add kernel.cl to compile sources?
@savaskoc The problem seems to be that I don't have any opencl sdk, only the driver and implementation afaik, as I am not using the AMD app SDK
I have a pc :)
I think you misunderstand, It wasn't the kernel.cl.h that caused a problem. I dont have the OpenCL/opencl.h, because I don't have an opencl SDK. I'm installing one now (Intel openCL sdk)
Ok, now that that is installed, it seems that there are types used that don't come from openCL
On my macbook pro the following yields the same error.
clang++ -framework OpenCL longtst.cpp -o longtst
#include <iostream>
#ifdef __APPLE__
#include <opencl/opencl.h>
#else
#include <CL/opencl.h>
#endif
#define DATA_SIZE 1
#define LONG_DATA_SIZE DATA_SIZE*sizeof(cl_long)
int main(int argc, char **argv){
long out[DATA_SIZE];
out[0]=0L;
// How many platforms are there ?
cl_uint platformc = 0;
clGetPlatformIDs(0, NULL, &platformc);
if (platformc >0){
// Extract a list of available platforms
cl_platform_id *platforms = new cl_platform_id[platformc];
clGetPlatformIDs(platformc, platforms, NULL);
cl_device_id device_id=0;
// loop through platforms until we have a valid GPU device
for (unsigned int i = 0; !device_id && i < platformc; ++i) {
clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_GPU, 1, &device_id, nullptr);
}
delete[] platforms;
// only device_id context and command queue needed below
if (device_id){
cl_int err;
// Create a context
cl_context context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);
// Create command queue for this context
cl_command_queue command_queue = clCreateCommandQueue(context, device_id, 0, &err);
// Here is our OpenCL kernel source for
const char *source =
"__constant int start = 2147483647;\n"
"static long calculate(int tc){\n"
" return (long)tc * 100;\n"
"}\n"
"__kernel void longtst(__global long *result){\n"
" int id = get_global_id(0);\n"
" result[id] = calculate(start + id);\n"
"}\n";
// Compile source
cl_program program = clCreateProgramWithSource(context, 1, (const char **) &source, NULL, &err);
err = clBuildProgram(program, 1, &(device_id), NULL, NULL, NULL);
// Extract and show any compile errors or warnings
if (err != CL_SUCCESS){
size_t len;
err = clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG, 0, NULL, &len);
if (len >0){
len++; // for '\0'
char *compile_log = (char *) malloc(len);
clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG, len, (void *)compile_log, NULL);
std::cerr <<"log{"<<std::endl<< compile_log << std::endl<<"}"<<std::endl;
free (compile_log);
}
}
// A program can have more than one kernel, select the kernel we want to call
cl_kernel kernel = clCreateKernel(program, "longtst", &err);
// Create buffers which 'wrap' the host data
cl_mem outBuf = clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_WRITE_ONLY, LONG_DATA_SIZE, (void*)out, &err);
// Set any kernel args
err = clSetKernelArg(kernel, 0 , sizeof(cl_mem), &outBuf);
// An event list helps us dispatch efficiently
#define EVENTS 2
cl_event *events = new cl_event[EVENTS];
// Decide how to partition the execution (we choose 1 group 1 threads)
size_t globalRange = DATA_SIZE;
size_t localRange = DATA_SIZE;
// Enqueue the execution
err = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &globalRange, &localRange, 0, NULL, &events[1]);
// Enqueue a read of 'out' data to the command queue
err = clEnqueueReadBuffer(command_queue, outBuf, CL_TRUE, 0, LONG_DATA_SIZE, out, 1, &events[1], &events[0]);
// Wait for all the dispatches to complete.
err = clWaitForEvents(EVENTS, events);
// Release and delete the events
for (int i=0; i<EVENTS; i++){
err = clReleaseEvent(events[i]);
}
delete[] events;
// Release mem objects
clReleaseMemObject(outBuf);
// Release Kernel
clReleaseKernel(kernel);
// Release Program
clReleaseProgram(program);
// Release Context
clReleaseContext(context);
// Release Command Queue
clReleaseCommandQueue(command_queue);
// Note that we don't releaese any type ending in _id
std::cout << out[0] << std::endl;
}
}
}
BTW If I switch to use CPU device the above code yields the correct result. So yes I think there is an OpenCL runtime GPU issue here
Retina 15 inch 2.8 Ghz Intel I7, Intel Iris Pro Graphics.
I tried to compile your example using: g++ -lOpenCL -o longtst main.cpp -I"P:\Program Files (x86)\Intel\OpenCL SDK\6.3\include" -std=c++11 -L"P:\Program Files (x86)\Intel\OpenCL SDK\6.3\lib\x64"
and got the following errors:
||=== Build: Debug in OpenCLDemo (compiler: GNU GCC Compiler) ===|
main.cpp||In function 'int main(int, char**)':|
main.cpp|40|warning: '_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)' is deprecated (declared at P:\Program Files (x86)\Intel\OpenCL SDK\6.3\include/CL/cl.h:1428) [-Wdeprecated-declarations]|
main.cpp|40|warning: '_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)' is deprecated (declared at P:\Program Files (x86)\Intel\OpenCL SDK\6.3\include/CL/cl.h:1428) [-Wdeprecated-declarations]|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clGetPlatformIDs@12'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clGetPlatformIDs@12'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clGetDeviceIDs@24'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clCreateContext@24'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clCreateCommandQueue@20'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clCreateProgramWithSource@20'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clBuildProgram@24'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clGetProgramBuildInfo@24'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clGetProgramBuildInfo@24'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clCreateKernel@12'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clCreateBuffer@24'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clSetKernelArg@16'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clEnqueueNDRangeKernel@36'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clEnqueueReadBuffer@36'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clWaitForEvents@8'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clReleaseEvent@4'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clReleaseMemObject@4'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clReleaseKernel@4'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clReleaseProgram@4'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clReleaseContext@4'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clReleaseCommandQueue@4'|
||=== Build failed: 21 error(s), 2 warning(s) (0 minute(s), 0 second(s)) ===|
You are failing to link.
Check the dir where you 'think' your opencl.so is.
Also could it be that you are pointing to 64 bit lib and creating a 32 bit app?
Gary
Also I think the -l option has to appear after the L
I tried linking to the 32bit library as well and tried with the -l after the -L. After installing mingw-w64 or using cygwin g++ it had the same problems
Is there a suitable opencl.dll lib in "P:\Program Files (x86)\Intel\OpenCL SDK\6.3\lib\x64" ?
You must have an opencl dll on your machine if you have ever used Aparapi ;)
What is your command line now?
I do get an executable now, but it gives the following error: the application was unable to start correctly 0xc00007b
I tried a couple of the fixes that I found online, but they don't seem to fix the issue. I am using C::B IDE v16.01, running the mingw-w64 compiler v6.2.0 from Mingw-builds, Intel OpenCL SDK v6.3, trying to compile grfrost's code with the commands:
x86_64-w64-mingw32-g++.exe -Wall -g -std=c++11 -c -I"P:\Program Files (x86)\Intel\OpenCL SDK\6.3\include" -L"P:\Program Files (x86)\Intel\OpenCL SDK\6.3\lib\x64" -o obj\Debug\main.o main.cpp -lOpenCL
x86_64-w64-mingw32-g++.exe -L"P:\Program Files (x86)\Intel\OpenCL SDK\6.3\lib\x64" -std=c++11 -o bin\Debug\OpenCLDemo.exe obj\Debug\main.o -lOpenCL
I tried netbean's c/c++ plugin as well but that was a disaster. Unless someone else knows the solution, I won't be trying to run the code to see if the overflow problem occurs.
"This error is most likely a result of a 32-bit (x86) executable trying to load a 64-bit (x64) DLL. You might have to adjust your PATH or copy DLLs to avoid this. For example, my PATH is set up to find the x64 version of d3dcompiler_46.dll (for DX11.1)."
Maybe your path has 32 dll's in it.
I don't really use Windows these days, there is a great tool for debugging this sort of thing called http://www.dependencywalker.com/.
If you install it and then point it at your executable, it will show you the dll's it is trying to load and thus the error.
Ok so few things.
@grfrost The DLL for Aparapi itself no longer needs to be installed. It is loaded into the path dynamically by the aparapi-jni project which is a dependency on aparapi. So if this particular DLL were the problem (loading the 32bit version on a 64 bit machine) then the problem would still be with aparapi rather than the user. However in the past on windows we have seen an error where if you try to load the 32 bit dll on a 64 bit system it refuses to load and throws an exception. According to @CC007 however the error discussed in this bug doesnt occur on windows but it has been observed on both Mac and Linux. The Mac dylib is 64bit only.
Note that OpenCL still needs to be installed manually of course. It is just the aparapi dll that no longer needs to be installed manually.
So based on these details I suspect it isnt the aparapi native library that is the issue. It doesnt rule out that a 32bit of opencl was installed or something of course. But the fact that all of us were able to get the error on mac and linux seems suspicious to this point (unless we all made the same mistake when setting up our environment?).
Another point I'm going to explore in a few minutes is to confirm @CC007 comment that it works on windows. This seems like an important detail that might help us to debug.
Another point to consider if this is a 32bit dll issue is how the dll to be loaded is chosen by aparapi-jni. It uses the arch property to determine if the system is 32bit or 64bit. However one caveat I have not been able to test yet is a 32bit JVM on a 64bit system. In this case the arch would be reported as 32bit on windows. Ergo this may be part of the problem. However since old aparapi (before aparapi-jni was written) still had this problem I am not convinced this is the issue.
Anyway, just all things to consider. I am investigating this as we speak to see if i can find any more clues.
So I just spent over an hour trying to tinker with this problem again and my results this time are more perplexing than the last.
First I went over to my windows box and ran the above code and it did not produce the error we had been seeing. Excited this might be a clue I wanted to double check my work and headed back over to linux and ran the original code I pasted in the first post of this issue. To my disbelief it no longer produced the erroneous results either, it appears to have magically started working. This was odd as the bug was consistently produced when I tried running this code yesterday.
Thinking this may have been an a problem with my use case; perhaps I made a mistake when i copy and pasted it to the issue. So I loaded the original code supplied by @savaskoc again to re-rerun it. Again to my amazement it no longer produced the bug on either linux or windows for me. The following is the code I just ran on linux that is now working but previously was not:
public class Main {
public static void main(String[] args) throws FileNotFoundException, UnsupportedEncodingException {
int num = 406816900 - 406816880;
TCKernel kernel = new TCKernel(406816880, num);
kernel.execute(num);
PrintWriter writer = new PrintWriter("numbers.txt");
kernel.saveResults(writer);
writer.flush();
}
public static class TCKernel extends Kernel {
long[] result;
int start;
public TCKernel(int start, int num) {
this.result = new long[num];
this.start = start;
}
@Override
public void run() {
result[getGlobalId()] = calculate(start + getGlobalId());
}
public long calculate(int tc) {
int num = tc;
int n9 = num % 10;
num /= 10;
int n8 = num % 10;
num /= 10;
int n7 = num % 10;
num /= 10;
int n6 = num % 10;
num /= 10;
int n5 = num % 10;
num /= 10;
int n4 = num % 10;
num /= 10;
int n3 = num % 10;
num /= 10;
int n2 = num % 10;
num /= 10;
int n1 = num % 10;
int odds = n1 + n3 + n5 + n7 + n9;
int evens = n2 + n4 + n6 + n8;
int n10 = (odds * 7 - evens) % 10;
int n11 = (odds + evens + n10) % 10;
return (long) tc * 100 + (n10 * 10 + n11);
}
public void saveResults(PrintWriter writer) {
writer.println("Result\t\tNum\t\tExpected");
for (int i = 0; i < result.length; i++) {
int tc = start + i;
writer.printf("%d\t%d\t%d%s", result[i], tc, calculate(tc), System.lineSeparator());
}
}
}
}
Result Num Expected
40681688012 406816880 40681688012
40681688180 406816881 40681688180
40681688258 406816882 40681688258
40681688326 406816883 40681688326
40681688494 406816884 40681688494
40681688562 406816885 40681688562
40681688630 406816886 40681688630
40681688708 406816887 40681688708
40681688876 406816888 40681688876
40681688944 406816889 40681688944
40681689002 406816890 40681689002
40681689170 406816891 40681689170
40681689248 406816892 40681689248
40681689316 406816893 40681689316
40681689484 406816894 40681689484
40681689552 406816895 40681689552
40681689620 406816896 40681689620
40681689798 406816897 40681689798
40681689866 406816898 40681689866
40681689934 406816899 40681689934
So I'm not sure what to do now, the bug appears to be intermittent somehow. The code yesterday was consistently giving me a bad result and now it is consistently giving me the correct result. Since I can no longer reproduce the error I was unable to debug the problem much to arrive at a solution.
Needless to say this has become a very frustrating bug for me now. @grfrost Can you think of any reason in aparapi that might produce intermittent results like this?
Have either of you ever witnessed it spontaneously start working during one execution or more?
UPDATE: After some reflection I think I was originally testing this on Mac and not linux after all. So my conclusion is that it isnt as weird as i first though. It simply is, and always has been, a mac only issue.
@CC007 Did you have an ubuntu box going somewhere you could run it on? If possible I'd be curious to see what sort of results you get when you run it on that box?
Ok. So here is my personal conclusion, tell me if you guys agree.
I have tested locally, it works on windows and linux but the bug occurs on mac. Both @grfrost and @savaskoc saw the error but only on Mac. Therefore unless anyone has experienced this bug on non-mac systems, I am going to conclude this is a problem that only occurs on mac.
Since this is a mac only issue it seems most likely the bug is in the OSX implementation of OpenCL and not in aparapi itself.
Any reason for anyone to suspect this isnt the case?
I think so that's a bug about osx's OpenCl implementation.
@savaskoc ok that means we need to see if we can find if a bug was already filed or not. I fit was we can reference it here until it is fixed. If not we should file one. Did you file a bug report with them yet?
I did but it would be good if you file another one
@savaskoc It isnt usually good practice to file the same bug twice. But I might be able to add useful comments to the bug you filed. Do you have a link to your bug?
Bug reporter doesn't provide a link. I can attach your comments to bug file, or you can file them
@savaskoc well link me to the site where you reported it so I can poke around at least.
@freemo https://bugreport.apple.com/
@grfrost It seems that the executable does use 64bit opencl, msvcrt and kernel libraries, but the libstdC++ library is 32bit. I think that I read that this was an issue with Code::Blocks in combination with mingw, but using any of the other compilers gives the errors I posted previously. Thanks for the help though
@grfrost This is what made it run without error http://stackoverflow.com/a/6405064, so probably I don't have a 64bit libstdC++ dynamic library installed.
Also should that code only output -100?
The following code produces different results when run on the GPU vs the CPU.
The output from the above code snippet is:
expected: 214748364700 result: 4294967196
I tested this on my Macbook pro but others noticed the problem as well on other unspecified platforms. Also changin the calculate function such that 100 is a long rather than an integer with
return (long) tc * 100l;
(notice the letter l at the end of the 100) will produce the exact same incorrect results as above.