[Bounty $50] Inconsistent results between GPU and CPU when integers overflow.

freemo commented 7 years ago

The following code produces different results when run on the GPU vs the CPU.

import com.aparapi.*;

public class Main {
    public static void main(String[] args) {
        int num = 1;

        final long[] result = new long[num];
        final int start = Integer.MAX_VALUE;

        Kernel kernel = new Kernel() {
            @Override
            public void run() {
                final int id = getGlobalId();
                result[id] = calculate(start + id);
            }
        };
        kernel.execute(num);

        System.out.println( "expected: " +  calculate(start) + " result: " + result[0]);
    }

    public static long calculate(int tc) {
        return (long) tc * 100;
    }
}

The output from the above code snippet is:

expected: 214748364700 result: 4294967196

I tested this on my Macbook pro but others noticed the problem as well on other unspecified platforms. Also changin the calculate function such that 100 is a long rather than an integer with return (long) tc * 100l; (notice the letter l at the end of the 100) will produce the exact same incorrect results as above.

freemo commented 7 years ago

IT should be noted this is not simply an issue with rolling over. if I change the calculate line to return (long) tc + 212600881053l; instead it should produce the same result mathematically using addition rather than multiplication. Despite this the program actually runs successfully with this new edit producing the following result:

expected: 214748364700 result: 214748364700

savaskoc commented 7 years ago

Hi, thanks for opening this issue to here.

I think there is something wrong in OpenCl. I am sending kernel and host code in C language. It STILL gives wrong results on GPU.

kernel.zip

CC007 commented 7 years ago

What opencl implementation are you using and what cpu/gpu does the macbook have?

savaskoc commented 7 years ago

It's default implementation in macOS. I'm using 10.12 (16A320) MacBook Pro (Retina, 13-inch, Late 2013). It has i5 4258U CPU and Iris 5100 GPU

CC007 commented 7 years ago

Are you using the amd app sdk?

CC007 commented 7 years ago

Also in the calculate can you try it wi brackets around the cast, putting the multiplication outside of the brackets?

freemo commented 7 years ago

@CC007 The issue seems to occur on linux as well and on AMD App SDK. It appears this bug is not platform specific. IF you runt he code in the original post on your local computer will probably see the bug as well. Did you try? the bug behaves very oddly for me. for example even though id is always 0 if you remove the "+ id" part int he calculate call it wont break anymore.

savaskoc commented 7 years ago

Anyone check the c code i posted?

freemo commented 7 years ago

@savaskoc Not yet but i will give it a go this evening. If your saying it produces the same incorrect results however then I expect you to be correct that it is an opencl issue directly.

savaskoc commented 7 years ago

I think this is an OpenCl issue because GPU produces incorrect results independently from platform and/or language (C, Java, Python etc.).

freemo commented 7 years ago

@savaskoc You realize the bug is producing the result but clipping it to 32 bits? To use the test I posted above as the example (the same is true for the numbers you posted) here is the breakdown.

Expected result is 214748364700 which in binary would be:

0011 0001 1111 1111 1111 1111 1111 1111 1001 1100

actual result we get is 4294967196 which in binary is:

0000 0000 1111 1111 1111 1111 1111 1111 1001 1100

Basically just drops all but the last 32 bits.

So while this is a legitimate bug, it is definitely occurring due to mishandling of 64bit variables.

savaskoc commented 7 years ago

I'm aware of that. I tried another types than long but still gives incorrect results. Maybe some GPU's can't process 64 bit data types?

freemo commented 7 years ago

@savaskoc I think there is more to it than that. I think when i was testing i tried some variants that produced some very odd results. I need to test this again to make sure I'm remembering correctly but when I removed the id variable (which in this test is always 0 anyway so shouldnt make a difference) it actually caused the correct results to be produced. Once I saw that behavior it was apparent that we were talking about a legitimate bug, I just cant confirm yet if the bug is isolated to aparapi or an OpenCL issue in general yet (I need to do more testing).

freemo commented 7 years ago

@savaskoc also as I stated in the OP if you change the operation to addition but change the value of the operand to make it mathematically equivalent, it magically works. So the problem seems to occur only on multiplication but not addition. this leads me to believe it is a genuine error rather than a hardware compatibility issue or something.

CC007 commented 7 years ago

Does opencl provide software 64bit support or hardware 64bit support (or both)?

grfrost commented 7 years ago

My guess is that savaskoc is correct. I think that the GPU OpenCL runtime does not support long.

Aparapi will detect this for doubles, surprised it does not detect this for long.

savaskoc commented 7 years ago

@grfrost If I am correct, why (long) tc + 212600881053l; line works but (long) tc * 100; not?

grfrost commented 7 years ago

Actually, I was about to retract ;) from the 1.0 spec https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/scalarDataTypes.html it does not look like cl_long is optional. So now I think I will blame Aparapi. My guess is that as the AST is built we end up with

     (long)
        | 
        *
 tc         100

Instead of

             *
         /        \
    (long)         100
        |
       tc

Can you try

return 100 * (long) tc ;

and/or

return (long)(tc+0L) * 100;

(hope my diagrams make it unscathed)

savaskoc commented 7 years ago

I doubt. even original OpenCl does not produce correct results. Check my kernel above.

CC007 commented 7 years ago

I ran the code on my computer (the code from freemo) and it runs as it should.

Using: CPU: 4670k GPU: 1070GTX Nvidia and intel openCL implementation

savaskoc commented 7 years ago

@CC007 You're saying that you get same results both GPU and CPU mode right? Can you try my kernel in C?

CC007 commented 7 years ago

I will try that one next. I also tested freemo's code by setting it up to use the cpu opencl device

savaskoc commented 7 years ago

It generated within build phase. Did you add kernel.cl to compile sources?

CC007 commented 7 years ago

@savaskoc The problem seems to be that I don't have any opencl sdk, only the driver and implementation afaik, as I am not using the AMD app SDK

CC007 commented 7 years ago

I have a pc :)

CC007 commented 7 years ago

I think you misunderstand, It wasn't the kernel.cl.h that caused a problem. I dont have the OpenCL/opencl.h, because I don't have an opencl SDK. I'm installing one now (Intel openCL sdk)

CC007 commented 7 years ago

Ok, now that that is installed, it seems that there are types used that don't come from openCL

grfrost commented 7 years ago

On my macbook pro the following yields the same error.

clang++ -framework OpenCL longtst.cpp -o longtst

#include <iostream>
#ifdef __APPLE__
#include <opencl/opencl.h>
#else
#include <CL/opencl.h>
#endif

#define DATA_SIZE 1
#define LONG_DATA_SIZE DATA_SIZE*sizeof(cl_long)
int main(int argc, char **argv){
   long out[DATA_SIZE];
   out[0]=0L;

   // How many platforms are there ?
   cl_uint platformc = 0;
   clGetPlatformIDs(0, NULL, &platformc);

   if (platformc >0){
      // Extract a list of available platforms
      cl_platform_id *platforms = new cl_platform_id[platformc];
      clGetPlatformIDs(platformc, platforms, NULL);

      cl_device_id device_id=0;
      // loop through platforms until we have a valid GPU device 
      for (unsigned int i = 0; !device_id && i < platformc; ++i) {
         clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_GPU, 1, &device_id, nullptr);
      }
      delete[] platforms;

      // only device_id context and command queue needed below

      if (device_id){
         cl_int err;

         // Create a context
         cl_context context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);

         // Create command queue for this context
         cl_command_queue command_queue = clCreateCommandQueue(context, device_id, 0, &err);

         // Here is our OpenCL kernel source for
         const char *source = 
            "__constant int start = 2147483647;\n"
            "static long calculate(int tc){\n"
            "    return (long)tc * 100;\n"
            "}\n"
            "__kernel void longtst(__global long *result){\n"
            "   int id = get_global_id(0);\n"
            "   result[id] = calculate(start + id);\n"
            "}\n";

         // Compile source 
         cl_program program = clCreateProgramWithSource(context, 1, (const char **) &source, NULL, &err);
         err = clBuildProgram(program, 1, &(device_id), NULL, NULL, NULL);

         // Extract and show any compile errors or warnings
         if (err != CL_SUCCESS){
            size_t len;
            err = clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG, 0, NULL, &len);
            if (len >0){
              len++; // for '\0'
              char *compile_log = (char *) malloc(len);
              clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG, len, (void *)compile_log, NULL);
              std::cerr <<"log{"<<std::endl<< compile_log << std::endl<<"}"<<std::endl;
              free (compile_log);
            }
         }

         // A program can have more than one kernel, select the kernel we want to call
         cl_kernel kernel = clCreateKernel(program, "longtst", &err);

         // Create buffers which 'wrap' the host data
         cl_mem outBuf = clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_WRITE_ONLY, LONG_DATA_SIZE, (void*)out, &err);

         // Set any kernel args
         err = clSetKernelArg(kernel, 0 , sizeof(cl_mem), &outBuf);

         // An event list helps us dispatch efficiently
#define EVENTS 2
         cl_event *events = new cl_event[EVENTS];

         // Decide how to partition the execution (we choose 1 group 1 threads)
         size_t globalRange = DATA_SIZE;
         size_t localRange = DATA_SIZE;

         // Enqueue the execution
         err = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &globalRange, &localRange, 0, NULL, &events[1]);

         // Enqueue a read of 'out' data to the command queue
         err = clEnqueueReadBuffer(command_queue, outBuf, CL_TRUE, 0, LONG_DATA_SIZE, out, 1, &events[1], &events[0]);  

         // Wait for all the dispatches to complete.
         err = clWaitForEvents(EVENTS, events);

         // Release and delete the events 
         for (int i=0; i<EVENTS; i++){
            err = clReleaseEvent(events[i]);
         }
         delete[] events;

         // Release mem objects
         clReleaseMemObject(outBuf);

         // Release Kernel
         clReleaseKernel(kernel);

         // Release Program
         clReleaseProgram(program);

         // Release Context
         clReleaseContext(context);

         // Release Command Queue
         clReleaseCommandQueue(command_queue);

         // Note that we don't releaese any type ending in _id

         std::cout << out[0] << std::endl;
      }
   }
}

grfrost commented 7 years ago

BTW If I switch to use CPU device the above code yields the correct result. So yes I think there is an OpenCL runtime GPU issue here

Retina 15 inch 2.8 Ghz Intel I7, Intel Iris Pro Graphics.

CC007 commented 7 years ago

I tried to compile your example using: g++ -lOpenCL -o longtst main.cpp -I"P:\Program Files (x86)\Intel\OpenCL SDK\6.3\include" -std=c++11 -L"P:\Program Files (x86)\Intel\OpenCL SDK\6.3\lib\x64" and got the following errors:

||=== Build: Debug in OpenCLDemo (compiler: GNU GCC Compiler) ===|
main.cpp||In function 'int main(int, char**)':|
main.cpp|40|warning: '_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)' is deprecated (declared at P:\Program Files (x86)\Intel\OpenCL SDK\6.3\include/CL/cl.h:1428) [-Wdeprecated-declarations]|
main.cpp|40|warning: '_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)' is deprecated (declared at P:\Program Files (x86)\Intel\OpenCL SDK\6.3\include/CL/cl.h:1428) [-Wdeprecated-declarations]|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clGetPlatformIDs@12'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clGetPlatformIDs@12'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clGetDeviceIDs@24'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clCreateContext@24'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clCreateCommandQueue@20'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clCreateProgramWithSource@20'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clBuildProgram@24'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clGetProgramBuildInfo@24'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clGetProgramBuildInfo@24'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clCreateKernel@12'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clCreateBuffer@24'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clSetKernelArg@16'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clEnqueueNDRangeKernel@36'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clEnqueueReadBuffer@36'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clWaitForEvents@8'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clReleaseEvent@4'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clReleaseMemObject@4'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clReleaseKernel@4'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clReleaseProgram@4'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clReleaseContext@4'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clReleaseCommandQueue@4'|
||=== Build failed: 21 error(s), 2 warning(s) (0 minute(s), 0 second(s)) ===|

grfrost commented 7 years ago

You are failing to link.

Check the dir where you 'think' your opencl.so is.

Also could it be that you are pointing to 64 bit lib and creating a 32 bit app?

Gary

grfrost commented 7 years ago

Also I think the -l option has to appear after the L

CC007 commented 7 years ago

I tried linking to the 32bit library as well and tried with the -l after the -L. After installing mingw-w64 or using cygwin g++ it had the same problems

grfrost commented 7 years ago

Is there a suitable opencl.dll lib in "P:\Program Files (x86)\Intel\OpenCL SDK\6.3\lib\x64" ?

You must have an opencl dll on your machine if you have ever used Aparapi ;)

grfrost commented 7 years ago

What is your command line now?

CC007 commented 7 years ago

I do get an executable now, but it gives the following error: the application was unable to start correctly 0xc00007b

I tried a couple of the fixes that I found online, but they don't seem to fix the issue. I am using C::B IDE v16.01, running the mingw-w64 compiler v6.2.0 from Mingw-builds, Intel OpenCL SDK v6.3, trying to compile grfrost's code with the commands:

x86_64-w64-mingw32-g++.exe -Wall -g -std=c++11 -c -I"P:\Program Files (x86)\Intel\OpenCL SDK\6.3\include" -L"P:\Program Files (x86)\Intel\OpenCL SDK\6.3\lib\x64" -o obj\Debug\main.o main.cpp -lOpenCL

x86_64-w64-mingw32-g++.exe -L"P:\Program Files (x86)\Intel\OpenCL SDK\6.3\lib\x64" -std=c++11 -o bin\Debug\OpenCLDemo.exe obj\Debug\main.o -lOpenCL

I tried netbean's c/c++ plugin as well but that was a disaster. Unless someone else knows the solution, I won't be trying to run the code to see if the overflow problem occurs.

grfrost commented 7 years ago

From this page - https://answers.microsoft.com/en-us/windows/forum/windows_7-performance/error-the-application-was-unable-to-start/05a2b904-3f61-4d08-94d6-e2ff92161111?auth=1

"This error is most likely a result of a 32-bit (x86) executable trying to load a 64-bit (x64) DLL. You might have to adjust your PATH or copy DLLs to avoid this. For example, my PATH is set up to find the x64 version of d3dcompiler_46.dll (for DX11.1)."

Maybe your path has 32 dll's in it.

I don't really use Windows these days, there is a great tool for debugging this sort of thing called http://www.dependencywalker.com/.

If you install it and then point it at your executable, it will show you the dll's it is trying to load and thus the error.

freemo commented 7 years ago

Ok so few things.

@grfrost The DLL for Aparapi itself no longer needs to be installed. It is loaded into the path dynamically by the aparapi-jni project which is a dependency on aparapi. So if this particular DLL were the problem (loading the 32bit version on a 64 bit machine) then the problem would still be with aparapi rather than the user. However in the past on windows we have seen an error where if you try to load the 32 bit dll on a 64 bit system it refuses to load and throws an exception. According to @CC007 however the error discussed in this bug doesnt occur on windows but it has been observed on both Mac and Linux. The Mac dylib is 64bit only.

Note that OpenCL still needs to be installed manually of course. It is just the aparapi dll that no longer needs to be installed manually.

So based on these details I suspect it isnt the aparapi native library that is the issue. It doesnt rule out that a 32bit of opencl was installed or something of course. But the fact that all of us were able to get the error on mac and linux seems suspicious to this point (unless we all made the same mistake when setting up our environment?).

Another point I'm going to explore in a few minutes is to confirm @CC007 comment that it works on windows. This seems like an important detail that might help us to debug.

Another point to consider if this is a 32bit dll issue is how the dll to be loaded is chosen by aparapi-jni. It uses the arch property to determine if the system is 32bit or 64bit. However one caveat I have not been able to test yet is a 32bit JVM on a 64bit system. In this case the arch would be reported as 32bit on windows. Ergo this may be part of the problem. However since old aparapi (before aparapi-jni was written) still had this problem I am not convinced this is the issue.

Anyway, just all things to consider. I am investigating this as we speak to see if i can find any more clues.

freemo commented 7 years ago

So I just spent over an hour trying to tinker with this problem again and my results this time are more perplexing than the last.

First I went over to my windows box and ran the above code and it did not produce the error we had been seeing. Excited this might be a clue I wanted to double check my work and headed back over to linux and ran the original code I pasted in the first post of this issue. To my disbelief it no longer produced the erroneous results either, it appears to have magically started working. This was odd as the bug was consistently produced when I tried running this code yesterday.

Thinking this may have been an a problem with my use case; perhaps I made a mistake when i copy and pasted it to the issue. So I loaded the original code supplied by @savaskoc again to re-rerun it. Again to my amazement it no longer produced the bug on either linux or windows for me. The following is the code I just ran on linux that is now working but previously was not:

public class Main {
    public static void main(String[] args) throws FileNotFoundException, UnsupportedEncodingException {
        int num = 406816900 - 406816880;

        TCKernel kernel = new TCKernel(406816880, num);
        kernel.execute(num);

        PrintWriter writer = new PrintWriter("numbers.txt");
        kernel.saveResults(writer);
        writer.flush();
    }

    public static class TCKernel extends Kernel {
        long[] result;
        int start;

        public TCKernel(int start, int num) {
            this.result = new long[num];
            this.start = start;
        }

        @Override
        public void run() {
            result[getGlobalId()] = calculate(start + getGlobalId());
        }

        public long calculate(int tc) {
            int num = tc;
            int n9 = num % 10;
            num /= 10;
            int n8 = num % 10;
            num /= 10;
            int n7 = num % 10;
            num /= 10;
            int n6 = num % 10;
            num /= 10;
            int n5 = num % 10;
            num /= 10;
            int n4 = num % 10;
            num /= 10;
            int n3 = num % 10;
            num /= 10;
            int n2 = num % 10;
            num /= 10;
            int n1 = num % 10;

            int odds = n1 + n3 + n5 + n7 + n9;
            int evens = n2 + n4 + n6 + n8;

            int n10 = (odds * 7 - evens) % 10;
            int n11 = (odds + evens + n10) % 10;

            return (long) tc * 100 + (n10 * 10 + n11);
        }

        public void saveResults(PrintWriter writer) {
            writer.println("Result\t\tNum\t\tExpected");
            for (int i = 0; i < result.length; i++) {
                int tc = start + i;
                writer.printf("%d\t%d\t%d%s", result[i], tc, calculate(tc), System.lineSeparator());
            }
        }
    }
}

Result      Num     Expected
40681688012 406816880   40681688012
40681688180 406816881   40681688180
40681688258 406816882   40681688258
40681688326 406816883   40681688326
40681688494 406816884   40681688494
40681688562 406816885   40681688562
40681688630 406816886   40681688630
40681688708 406816887   40681688708
40681688876 406816888   40681688876
40681688944 406816889   40681688944
40681689002 406816890   40681689002
40681689170 406816891   40681689170
40681689248 406816892   40681689248
40681689316 406816893   40681689316
40681689484 406816894   40681689484
40681689552 406816895   40681689552
40681689620 406816896   40681689620
40681689798 406816897   40681689798
40681689866 406816898   40681689866
40681689934 406816899   40681689934

So I'm not sure what to do now, the bug appears to be intermittent somehow. The code yesterday was consistently giving me a bad result and now it is consistently giving me the correct result. Since I can no longer reproduce the error I was unable to debug the problem much to arrive at a solution.

Needless to say this has become a very frustrating bug for me now. @grfrost Can you think of any reason in aparapi that might produce intermittent results like this?

Have either of you ever witnessed it spontaneously start working during one execution or more?

UPDATE: After some reflection I think I was originally testing this on Mac and not linux after all. So my conclusion is that it isnt as weird as i first though. It simply is, and always has been, a mac only issue.

freemo commented 7 years ago

@CC007 Did you have an ubuntu box going somewhere you could run it on? If possible I'd be curious to see what sort of results you get when you run it on that box?

freemo commented 7 years ago

Ok. So here is my personal conclusion, tell me if you guys agree.

I have tested locally, it works on windows and linux but the bug occurs on mac. Both @grfrost and @savaskoc saw the error but only on Mac. Therefore unless anyone has experienced this bug on non-mac systems, I am going to conclude this is a problem that only occurs on mac.

Since this is a mac only issue it seems most likely the bug is in the OSX implementation of OpenCL and not in aparapi itself.

Any reason for anyone to suspect this isnt the case?

savaskoc commented 7 years ago

I think so that's a bug about osx's OpenCl implementation.

freemo commented 7 years ago

@savaskoc ok that means we need to see if we can find if a bug was already filed or not. I fit was we can reference it here until it is fixed. If not we should file one. Did you file a bug report with them yet?

savaskoc commented 7 years ago

I did but it would be good if you file another one

freemo commented 7 years ago

@savaskoc It isnt usually good practice to file the same bug twice. But I might be able to add useful comments to the bug you filed. Do you have a link to your bug?

savaskoc commented 7 years ago

Bug reporter doesn't provide a link. I can attach your comments to bug file, or you can file them

freemo commented 7 years ago

@savaskoc well link me to the site where you reported it so I can poke around at least.

savaskoc commented 7 years ago

@freemo https://bugreport.apple.com/

CC007 commented 7 years ago

@grfrost It seems that the executable does use 64bit opencl, msvcrt and kernel libraries, but the libstdC++ library is 32bit. I think that I read that this was an issue with Code::Blocks in combination with mingw, but using any of the other compilers gives the errors I posted previously. Thanks for the help though

CC007 commented 7 years ago

@grfrost This is what made it run without error http://stackoverflow.com/a/6405064, so probably I don't have a 64bit libstdC++ dynamic library installed.

Also should that code only output -100?

Syncleus / aparapi

[Bounty $50] Inconsistent results between GPU and CPU when integers overflow. #38