CNugteren / CLBlast

Tuned OpenCL BLAS
Apache License 2.0
1.06k stars 202 forks source link

Error -2048 if context is released and acquired again #47

Closed blueberry closed 8 years ago

blueberry commented 8 years ago

This is the actual cause of previously reported Issue #43

I am using CLBlast (development branch) through JOCLBlast, but with enough experience, I pretty much suspect that this issue is due to CLBlast caching (although I didn't try this directly in C++ since my C++ foo is too rusty).

What's happening:

  1. I create device, context and queue, and use them to call some (JO)CLBlast function - works perfectly
  2. I release the queue, create it again, and then call (JO)CLBlast - works perfectly
  3. I release the queue and the context, create both again, and call (JO)CLBlast - I get a -2048 error

It seems to me that somehow the old, cached stuff gets mixed up with the newly provided queue.

blueberry commented 8 years ago

I suspect the problem is if the cache uses device as id for programs (compiled kernels), while clCreateProgramWithSource uses context. Then, in the cache, the program is found because the device id is the same, but the context that was used to create the program no longer exists!

CNugteren commented 8 years ago

But does the context matter? The binary is compiled for a specific device, not for a specific context. Anyway, I tried to reproduce it in a small C example, but I didn't manage to make it crash. I took the samples/sgemm.c program and added the following at line 94 (after printing Completed with status but before the clean-up):

  clReleaseCommandQueue(queue);
  clReleaseContext(context);
  cl_context context2 = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);
  cl_command_queue queue2 = clCreateCommandQueue(context2, device, 0, NULL);
  StatusCode status2 = CLBlastSgemm(kRowMajor, kNo, kNo,
                                    m, n, k,
                                    alpha,
                                    device_a, 0, a_ld,
                                    device_b, 0, b_ld,
                                    beta,
                                    device_c, 0, c_ld,
                                    &queue2, &event);
  clWaitForEvents(1, &event);
  printf("Second run (binary taken from cache) completed with status %d\n", status);

If you could provide me with a C or C++ example that has this behaviour, that would be really helpful.

CNugteren commented 8 years ago

There is now also a ClearCompiledProgramCache() (C++) and CLBlastClearCompiledProgramCache() (C) function in case you want to double-check whether it is a cache-related issue or not.

blueberry commented 8 years ago

I am out of town this week, but as soon as I get back I will investigate this.

gpu commented 8 years ago

I had to look it up, so summarize it here shortly: The error code is defined in clblast.h, as

  // Custom additional status codes for CLBlast
  kKernelLaunchError         = -2048, // Problem occurred when enqueuing the kernel

As far as I see, it only apears in routine.cc, as

  // Launches the kernel (and checks for launch errors)
  try {
    kernel.Launch(queue_, global, local, event, waitForEvents);
  } catch (...) { return StatusCode::kKernelLaunchError; }

The kernel.Launch call ends in clpp11.h, where (in all paths) the results of the innermost calls of clEnqueueNDRangeKernel are passed through the CheckError function, which is implemented as

// Error occurred in OpenCL
inline void CheckError(const cl_int status) {
  if (status != CL_SUCCESS) {
    throw std::runtime_error("Internal OpenCL error: "+std::to_string(status));
  }
}

So maybe I misunderstood something here, but this looks like whatever error is caused by clEnqueueNDRangeKernel, it will cause an exception to be thrown, which eventually will show up as the kKernelLaunchError. The original error code will be lost in translation. Maybe there is a way to preserve this error code? Like throwing an own exception type that stores the actual CL error code, and allows changing the catch block to something like

catch (MySpecialException& e) { return e.getStoredErrorCodeFromEnqueueCall(); }
catch (...) { return StatusCode::kKernelLaunchError; } // As a last resort fallback

(Just a guess, not sure whether this makes sense or really works this way)


Regarding the actual issue or its reason: I just tried this out, using a slightly modified version of the SGEMM example for JOCLBlast (roughly like Cedric described above), and also received the -2048.

But...

destroying the context is a real nuke. From my understanding, it will really desctroy everything. So if you destroy the context, then you do not only have to re-create the command queue, but also all memory objects - so, basically do a complete "shutdown and restart" (and I suspect that this would work, although I didn't test it explicitly).

So @blueberry could you provide a few lines of example code showing how you intended to manage this context change? (Maybe even in Clojure, now I might even be able to try this out ;-))


EDIT: @CNugteren In https://github.com/CNugteren/CLBlast/issues/47#issuecomment-215074336 you printed the status, and not the status2 - I haven't yet tried this in "plain" CLBlast, but this may be important here.

blueberry commented 8 years ago

@gpu I do (re)create everything. And it works ok with my kernels that are 2.0 specific, so this issue is somehow related to clblast. Basically, All my vectors and matrices have an engine attached and all live in a certain context that is provided by a caller. The idea is to avoid global singleton context but to leave to the caller to decide how the context and queue are managed.

At least we identified that the problem is either in JOCLBlast or in CLBlast if status2 is -2048 in cedric's example.

I'll prepare a clojure project for you.

blueberry commented 8 years ago

@gpu Actually, there are examples in the test folder and even a tutorial here http://neanderthal.uncomplicate.org/articles/tutorial_opencl.html To run it with the currently released Neanderthal 0.5.0, you'll need OpenCL 2.0 capable device.

gpu commented 8 years ago

Don't put tooo much effort into this: After modifying the sgemm.c example as desribed in https://github.com/CNugteren/CLBlast/issues/47#issuecomment-215074336 , and blatantly adding an output in CheckError...

// Error occurred in OpenCL
inline void CheckError(const cl_int status) {
  if (status != CL_SUCCESS) {
    printf("Return code is %d\n", status);
    throw std::runtime_error("Internal OpenCL error: "+std::to_string(status));
  }
}

it prints

Completed with event 0000005C9B1CDE60
Completed with status 0
Return code is -34
Second run (binary taken from cache) completed with status -2048

(printing the status2 here, of course). So the error code -34 (which is CL_INVALID_CONTEXT) shows up as -2048, as described above. I tried to figure out where it might get this invalid context from, but got a bit lost in the C++ structures (Routine, Queue etc) - I'd have to read this more thoroughly


A minor thing (not worth an own issue, I guess) : In clpp11.h, line 636, at the call

CheckError(clEnqueueNDRangeKernel(queue(), *kernel_, static_cast<cl_uint>(global.size()),
                                  nullptr, global.data(), local.data(),
                                  waitForEventsPlain.size(), waitForEventsPlain.data(),
                                  event));

it compains about the conversion from size_t to cl_int. It should be static_cast<cl_uint>(waitForEventsPlain.size()) (similar to the cast of global.size()).

CNugteren commented 8 years ago

Thanks for looking into this, both of you. Indeed, I checked the wrong status previously. I now extended the example at line 94 with the following:

  // Test code for a second call with a new context
  clReleaseMemObject(device_a);
  clReleaseMemObject(device_b);
  clReleaseMemObject(device_c);
  clReleaseCommandQueue(queue);
  clReleaseContext(context);
  cl_context context2 = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);
  cl_command_queue queue2 = clCreateCommandQueue(context2, device, 0, NULL);
  cl_mem device_a2 = clCreateBuffer(context2, CL_MEM_READ_WRITE, m*k*sizeof(float), NULL, NULL);
  cl_mem device_b2 = clCreateBuffer(context2, CL_MEM_READ_WRITE, n*k*sizeof(float), NULL, NULL);
  cl_mem device_c2 = clCreateBuffer(context2, CL_MEM_READ_WRITE, m*n*sizeof(float), NULL, NULL);
  clEnqueueWriteBuffer(queue2, device_a2, CL_TRUE, 0, m*k*sizeof(float), host_a, 0, NULL, NULL);
  clEnqueueWriteBuffer(queue2, device_b2, CL_TRUE, 0, n*k*sizeof(float), host_b, 0, NULL, NULL);
  clEnqueueWriteBuffer(queue2, device_c2, CL_TRUE, 0, m*n*sizeof(float), host_c, 0, NULL, NULL);
  CLBlastClearCompiledProgramCache(); // This line is needed
  StatusCode status2 = CLBlastSgemm(kRowMajor, kNo, kNo,
                                    m, n, k,
                                    alpha,
                                    device_a2, 0, a_ld,
                                    device_b2, 0, b_ld,
                                    beta,
                                    device_c2, 0, c_ld,
                                    &queue2, &event);
  clWaitForEvents(1, &event);
  printf("Second run completed with status %d\n", status2);

I now get -48 if I omit CLBlastClearCompiledProgramCache(), which means CL_INVALID_KERNEL. It seems your first hunch was right, it has something to do with the cache. I'll look into this tomorrow.

blueberry commented 8 years ago

One thing that I suspect is this: the program is created using a context, and then compiled with clBuildProgram. Now, there are two ways to create that program: clCreteProgramWithSource and clCreateProgramWithBinary. If compiled binaries could be reused between contexts (this error seems to point to point to that they couldn't) CLCreateProgramFromBinary wouldn't be much useful. Indeed, see the notes at https://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/clCreateProgramWithBinary.html They recommend this route for caching (although I would still prefer an explicit initialize() function just to make sure that invocations would be at full speed even the first time the function is called)

CNugteren commented 8 years ago

OK, I identified the issue. What CLBlast stores in its cache is the cl_program object. However, it seems that this is associated with a specific context, hence the issue you are encountering when CLBlast is called on a different context afterwards.

The solution is to use clGetProgramInfo to retrieve the actual binary from the cl_program, store that in the cache, and then later on create a new cl_program using clCreateProgramWithBinary. I'll need to redesign some parts of CLBlast for this.

blueberry commented 8 years ago

Be careful, though, since clCreateProgramWithBinary might induce a big overhead (might, i am not sure exactly what's the speed of it) if it needs to be called each time something is taken from the cache. Maybe a better solution in the case of a big overhead would be a per-context cache, where cache can be initialized/cleaned for each context explicitly if that is important, and also can be global as it is now if the application does not use more than one context?

CNugteren commented 8 years ago

I understand, but most probably the bulk of the compilation is already done. I quickly tested on my system and that seems to be the case. But it might indeed be different on other systems.

For now, there is version that should work in development. It caches the binary instead of the cl_program. If this turns out to be a performance issue, we'll look at it again in a later stage. But I think this is what the OpenCL manual means with caching. Also, I believe this is what clBLAS is doing, although I am not 100% certain.

Also, the clear-cache functions are now named clblast::ClearCache (C++) and CLBlastClearCache (C).

Thanks again both for pointing out this issue.

gpu commented 8 years ago

( @blueberry https://github.com/gpu/JOCLBlast/issues/5 is still open, I'll rename the functions soon ;-))

gcp commented 8 years ago

Also, the clear-cache functions are now named clblast::ClearCache (C++) and CLBlastClearCache (C).

Although those are in the clblast.h header, they are missing PUBLIC_API in clblast.cpp, so you will get a link failure if you try to use them.

CNugteren commented 8 years ago

You are right, they were missing from the C++ version. They were already exported in the C API though. I now added the PUBLIC_API for the C++ API to the clblast.h file (these functions are not templated so I think that is the right place for the declspec attribute). This is part of the b330ab0 commit. This commit also adds __declspec(dllimport) in case the headers are included but the library is already built.

I don't have a Windows system to test on right now, but the AppVeyor builds seem to pass.

blueberry commented 8 years ago

Does it affect 0.8.0? @gpu plans to release JOCLBlast next week, so perhaps this requires a 0.8.1 release?

CNugteren commented 8 years ago

The commit is made to the development branch, so it is not in 0.8.0. I am not a Windows export, so I don't know how important the missing __declspec(dllimport) is, but I haven't heard of any issues. It was the same in 0.7.0 and 0.7.1. The missing __declspec(dllexport) from the cache functions in the C++ API is not important enough for a new release I would say, since those functions are unlikely to be used anyway.

gpu commented 8 years ago

Indeed, at least for JOCLBlast, this is not critical, because it does not use the C++ API at all.