clBLASgemm: "mixed vector-scalar operation not allowed unless up-convertable" on Hawaii

jyegerlehner commented 9 years ago

Related to this Caffe PR, when I run the Caffe tests on an R9-290X, I see a test failure apparently trying to compile clBLAS kernel for clBLASgemm. This is the error. If instead I run the same tests on the same machine but specify them to use the 7950/Tahiti device, all the tests pass. So is clBLASgemm broken on Hawaii? Also, if I run the tests choosing the CPU (FX6350) as the OpenCL device, I also get the same compile error that I get as when I specify the Hawaii device. This is what clinfo shows on the machine, an Ubuntu all-AMD and OpenCL machine with no Cuda.

To run these tests, I'm building what's in this branch. I expect the same problem would manifest with @lunochod 's branch, but it doesn't have the ability to specify which OpenCL device the tests should be run on, whereas mine does. So I can't verify on his branch, since device defaults to the first GPU with his branch, which is the Tahiti device. Namely, when you run test.testbin you add one command line argument to specify which device. In my case 0 (the default) is the CPU, 1 is the Tahiti device and 2 is the Hawaii card. If someone wants to reproduce this by building and running the tests, you will need to install all the caffe prerequisites first (except CUDA), and build using cmake, not the Makefile that comes in the caffe directory. I can help if you have questions about that.

TimmyLiu commented 9 years ago

Hi, thanks for reporting the issue! I believe the error you are seeing is related to the newer compiler being more strict about casting, similar to previously reported issue #62 and issue #53. I am going to look into this. Can you share the parameters (M, N, K, transA, transB, order, alpha, beta) of sgemm that you running into this problem so I can reproduce it without running caffe.

With that being said, there might be a better solution for you. Currently in the develop branch we have enabled offline compilation (instead of generating kernels at runtime and compile it at runtime) for Hawaii device. A subset of sgemm (mostly column major NN, NT, TN) supports this feature. And it looks like you have one of the latest driver and Hawaii device. You can choose to enable offline compilation from cmake, following instruction here https://github.com/clMathLibraries/clBLAS/wiki/Build under "offline kernel compilation". This should give you better performance since the dynamic generated kernel were written for previous generations of GPU. We are in the effort of including all of sgemm for this feature. Hopefully the problem size you are using is supported.

jyegerlehner commented 9 years ago

Thanks for the reply.

OK, here are the dimensions that were used in cases that failed.

E0424 14:22:00.263455  6639 OpenCLSupport.cpp:1534] clblasSgemm() failed on GPU Hawaii
clblasRowMajor 
TransA: 0
TransB: 1
m: 5
n: 1
k:300
beta: 0

clblasRowMajor 
TransA: 0
TransB: 1
m: 10
n: 10
k:10
beta: 0

E0424 15:44:12.684367  6639 OpenCLSupport.cpp:1534] clblasSgemm() failed on GPU Hawaii
clblasRowMajor 
TransA: 0
TransB: 1
m: 2
n: 10
k:60
beta: 0

Not all the compilation errors and warnings were the same as the ones I posted before. Here's the more complete set


Build log:
================
"/tmp/OCL6639T1422.cl", line 12: warning: OpenCL extension is now part of core
  #pragma OPENCL EXTENSION cl_khr_fp64 : enable
                           ^

"/tmp/OCL6639T1422.cl", line 120: error: mixed vector-scalar operation not
          allowed unless up-convertable(scalar-type=>vector-element-type)
              const uint2 bk = ((uint2)(0, 1) + (( get_group_id(0)*0 + k ) >> 1)) % vKB;
                                                ^

"/tmp/OCL6639T1422.cl", line 396: warning: a value of type "__global double2 *"
          cannot be assigned to an entity of type "__global double *"
                          uC.d = C + (coordY * ldc + coordX)/2;
                               ^

"/tmp/OCL6639T1422.cl", line 420: warning: a value of type "__global double2 *"
          cannot be assigned to an entity of type "__global double *"
                          uC.d = C + (coordY * ldc + coordX)/2;
                               ^

"/tmp/OCL6639T1422.cl", line 417: warning: variable "i" was declared but never
          referenced
                          int i, j;
                              ^

"/tmp/OCL6639T1422.cl", line 417: warning: variable "j" was declared but never
          referenced
                          int i, j;
                                 ^

"/tmp/OCL6639T1422.cl", line 418: warning: variable "res" was declared but
          never referenced
                          PPtr res;
                               ^

=====================
"/tmp/OCL6639T1359.cl", line 99: error: mixed vector-scalar operation not
          allowed unless up-convertable(scalar-type=>vector-element-type)
              const uint4 bk = ((uint4)(0, 1, 2, 3) + (( get_group_id(0)*0 + k ) >> 1)) % vKB;
                                                      ^

"/tmp/OCL6639T1359.cl", line 567: warning: a value of type "__global float2 *"
          cannot be assigned to an entity of type "__global float *"
                          uC.f = C + (coordY * ldc + coordX)/2;
                               ^

"/tmp/OCL6639T1359.cl", line 591: warning: a value of type "__global float2 *"
          cannot be assigned to an entity of type "__global float *"
                          uC.f = C + (coordY * ldc + coordX)/2;
                               ^

"/tmp/OCL6639T1359.cl", line 588: warning: variable "i" was declared but never
          referenced
                          int i, j;
                              ^

"/tmp/OCL6639T1359.cl", line 588: warning: variable "j" was declared but never
          referenced
                          int i, j;
                                 ^

"/tmp/OCL6639T1359.cl", line 589: warning: variable "res" was declared but
          never referenced
                          PPtr res;
                               ^
=======================
Build log:

"/tmp/OCL6639T6821.cl", line 93: error: mixed vector-scalar operation not
          allowed unless up-convertable(scalar-type=>vector-element-type)
              const uint2 bk = ((uint2)(0, 1) + (( get_group_id(0)*0 + k ) >> 2)) % vKB;
                                                ^

"/tmp/OCL6639T6821.cl", line 282: warning: a value of type "__global float2 *"
          cannot be assigned to an entity of type "__global float *"
                          uC.f = C + (coordY * ldc + coordX)/2;
                               ^

"/tmp/OCL6639T6821.cl", line 297: warning: a value of type "__global float2 *"
          cannot be assigned to an entity of type "__global float *"
                          uC.f = C + (coordY * ldc + coordX)/2;
                               ^

"/tmp/OCL6639T6821.cl", line 294: warning: variable "i" was declared but never
          referenced
                          int i, j;
                              ^

"/tmp/OCL6639T6821.cl", line 294: warning: variable "j" was declared but never
          referenced
                          int i, j;
                                 ^

"/tmp/OCL6639T6821.cl", line 295: warning: variable "res" was declared but
          never referenced
                          PPtr res;
                               ^

Currently in the develop branch we have enabled offline compilation (instead of generating kernels at runtime and compile it at runtime) for Hawaii device. A subset of sgemm (mostly column major NN, NT, TN) supports this feature.

We're row major so I guess I'll skip this. My situation is that I'm developing a library that needs to work anywhere for anyone with standards-compliant OCL devices and drivers. Special-case work-arounds are not really of interest as a long term solution. Surely the kernel compilation only has to happen once.

Hopefully the problem size you are using is supported.

The problem size I'm using is any size at all. People can build whatever neural nets they want with Caffe, so the size of the problem is completely unconstrained. Libraries such as Caffe and clBLAS need to be general and robust, not a pile of special cases littered with landmines waiting to blow up when someone does something that wasn't what the developers imagined they might do.

TimmyLiu commented 9 years ago

Thanks for sending the parameters. I need to grab the same driver with you to reproduce the error. Are you using the catalyst omega?

Acutally by "mostly column major NN, NT, TN" I meant row major TT, TN and NT are also supported. The only subset that is not supported are column major TT and row major NN. The matrix size you are running are pretty small and online compilation (clBuildProgram) will probably be a bottleneck for you. Thus I still recommend give the offline compilation a try. For non-hawaii device the library should fall back to the dynamic kernels.

jyegerlehner commented 9 years ago

Acutally by "mostly column major NN, NT, TN" I meant row major TT, TN and NT are also supported. The only subset that is not supported are column major TT and row major NN.

OK thanks; maybe I'll go give it a whirl.

jyegerlehner commented 9 years ago

Are you using the catalyst omega?

I doubt it. Can you tell from the clinfo output? It's the latest fglrx packaged for Ubuntu, except I think it's in the proposed/pre-released update packages, not the recommended/released ones. The released ones didn't work. I think Catalyst Omega requires the AMD install that has always broken the package management system, and I end up spending too much time trying to get video working again, and getting rid of broken packages. So I tend to shy away from that.

TimmyLiu commented 9 years ago

I see your driver version is 1729.3. Internally I grabbed 1642 and 1778 and I could not reproduce the issue. I did this on windows. I guess I will switch to Linux and see if I can see the error.

TimmyLiu commented 9 years ago

Even though I am not able to reproduce the crash, I think the issue might have already been fixed in the current develop and master branch. I did "make-ktest --function sgemm -M 5 -N 1 -K 300 --transB t --beta 0" on release 2.2 I can see the dumped kernel has "const uint2 bk = ((uint2)(0, 1) + (( get_group_id(0)_0 + k ) >> 2)) % vKB;" which was reporting error for you. I did the same on current develop branch. The dumped kernel generated "const uint2 bk = ((uint2)(0, 1) + (uint)(( (uint)(get_group_id(0))_0 + k ) >> 2)) % vKB;" instead.

Can you checkout the latest code and verify?

jyegerlehner commented 9 years ago

I pulled the develop branch, switched on Hawaii and Tahiti offline build in CMakeLists.txt:

option( OCL_OFFLINE_BUILD_HAWAII_KERNEL "Offline compile the OpenCL kernels for Hawaii device" ON)
option( OCL_OFFLINE_BUILD_BONAIRE_KERNEL "Offline compile the OpenCL kernels for Bonaire device" OFF)
option( OCL_OFFLINE_BUILD_TAHITI_KERNEL "Offline compile the OpenCL kernels for Tathit device" ON)

rebuilt, deployed the clblas lib and re-ran the caffe tests. In the case of executing the tests on the Tahiti device, the tests still passed (except for some timer test I'm not too worried about) but a large number of messages we don't support clblas on 32 bits was printed out. This is a 64 bit machine and OS.

In the case of running the tests on Hawaii, there weren't any kernel compilation errors of course. But the test application exits every time in the middle of one test part way through DeconvolutionLayerTest/2.TestGradient. Something must be sending it a SIGKILL. If it were any interceptable signal the google test main app would tell us what it was. It just exits. Since the only difference is which GPU we are executing on, I'm thinking the problem must be something either in clBLAS or the OCL driver stack, or those offline built kernels.

TimmyLiu commented 9 years ago

Hawaii and Tahiti compiles different sets of kernels and are compiled with different flags. For the hawaii test cases, what sgemm parameter causes the exit? I don't have enough knowledge of the caffe code. Can you post the host code?

jyegerlehner commented 9 years ago

For the hawaii test cases, what sgemm parameter causes the exit?

Will have to get back to you on that. Starting to wonder if I have a corrupt driver installation. Perhaps I'll purge fglrx and try latest catalyst to see if things behave more sanely.

Can you post the host code?

Not feasibly in a single post. It's here.

TimmyLiu commented 9 years ago

Have you tried turning the offline compilation for Hawaii off with the latest code. I think the casting issue might have already been fixed with the latest develop branch.

jyegerlehner commented 9 years ago

Have you tried turning the offline compilation for Hawaii off with the latest code.

I have not tried that. What I did is I uninstalled fglrx from the Canonical repositories, and installed the Catalyst debs from AMD. After doing so clinfo shows this is version Platform Version: OpenCL 2.0 AMD-APP (1642.5) which is an older version than I had from the Ubuntu packages. I guess this must not be the Omega beta, since it is an older version than what I had before.

Anyways, running the Caffe tests with this fglrx version (1642.5), and the Develop branch of clBLAS I saw the same to problems named above, namely, to reiterate, 1) when running the Caffe tests on the Tahiti device, the clBLAS library prints out oceans of we don't support clblas on 32 bits or something similar to that. Don't know why, as this is a 64 bit machine and OS. But otherwise the Caffe tests pass. 2) when running the same tests on the Hawaii device, they fail in the Caffe DeconvolutionLayer Gradient test; the test is killed by a SIGKILL. Whence the signal comes I know not.

You asked earlier for the arguments passed to clBLASgemm call that causes the Caffe tests to be killed. I don't think it matters, because if you tested it, I'm sure it would execute without failing. In the course of the Caffe DeconvolutionLayer Gradient test, clBLASgemm is called hundreds if not thousands of times. What I observed is that the test runs for a while, with many successful calls to clBLASgemm, and then the machine gradually becomes unresponsive, and then ultimately fails, with the test app having been killed with SIGKILL (I presume). clBLASgemm is called with those same arguments many times and succeeds without failing before it ultimately fails.

The next thing I did was to rebuild the clBLAS library from the Master branch (again), and reinstall it. Then, with that version of clBLAS, and still the 1642.5 version of OCL drivers, I was able to run the Caffe tests such that both the Tahiti and Hawaii devices completed the tests without failing. So I did find a combination of OCL drivers and clBLAS versions that works. So there was one combination of the four that I tried that allows the Caffe unit tests to pass: Catalyst 1642.5 and the Master branch of clBLAS. All other combinations failed as described above.

So those are my observations in a nutshell. I hope that's helpful.

Thanks for the help,

Jim

TimmyLiu commented 9 years ago

I am glad at least one version is working for you. I am going to close this issue. I suspect the crash you are seeing was caused by memory leak anywhere from the application to library to driver stack. I wanted to build your branch of caffe and dig it out. Hope you don't mind if I open issue in your branch to ask questions.

Timmy

jyegerlehner commented 9 years ago

Sure, that sounds good. Please keep in mind this is not mostly my code but rather that of lunochod, as you can see in the original Caffe PR. I don't understand it as well as he, so I may not always have good answers. I could certainly have created new bugs in my branch. You might consider starting from his fork instead of mine and thereby bypass any bugs I created. But feel free to post issues on my fork if you want. We could swap email addresses if you should prefer to communicate that way for any reason.

bhack commented 9 years ago

@TimmyLiu You can follow also send more general comments comments at https://github.com/BVLC/caffe/pull/2195

lunochod commented 9 years ago

@TimmyLiu @jyegerlehner

I run into the same issue we don't support clblas on 32 bits. I had a look at the code and the 64 bitness is queried from the OpenCL runtime / device used. I used following work-around to switch the OpenCL runtime to 64BIT:

export GPU_FORCE_64BIT_PTR=1

To see the change take effect execute:

clinfo | grep Addr

BenjaminCoquelle commented 9 years ago

Hi With a recent driver, actually since 14.30 branch, our runtime will compile in 64 bits ISA for 64 bits host application and 32 bits ISA for 32 applications. AFAIK clinfo is a 32 bits application and thus will always report 32 bits. When compiling clBLAS make sure you compile it in 64 bits.

Benjamin

lunochod commented 9 years ago

This is not related to the bitness of clBLAS. The clBLAS code in functor.cc calls:

cl_uint error = clGetDeviceInfo(device, CL_DEVICE_ADDRESS_BITS, sizeof(cl_uint), &bitness, NULL);

which returns the default size of the address space of the GPU as reported by the runtime.

On 06/26/2015 04:35 AM, BenjaminCoquelle wrote:

Hi With a recent driver, actually since 14.30 branch, our runtime will compile in 64 bits ISA for 64 bits host application and 32 bits ISA for 32 applications. AFAIK clinfo is a 32 bits application and thus will always report 32 bits. When compiling clBLAS make sure you compile it in 64 bits.

Benjamin

— Reply to this email directly or view it on GitHub https://github.com/clMathLibraries/clBLAS/issues/89#issuecomment-115652356.

clMathLibraries / clBLAS

clBLASgemm: "mixed vector-scalar operation not allowed unless up-convertable" on Hawaii #89