Closed blueberry closed 8 years ago
I suspect the problem is if the cache uses device as id for programs (compiled kernels), while clCreateProgramWithSource uses context. Then, in the cache, the program is found because the device id is the same, but the context that was used to create the program no longer exists!
But does the context matter? The binary is compiled for a specific device, not for a specific context. Anyway, I tried to reproduce it in a small C example, but I didn't manage to make it crash. I took the samples/sgemm.c
program and added the following at line 94 (after printing Completed with status
but before the clean-up):
clReleaseCommandQueue(queue);
clReleaseContext(context);
cl_context context2 = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);
cl_command_queue queue2 = clCreateCommandQueue(context2, device, 0, NULL);
StatusCode status2 = CLBlastSgemm(kRowMajor, kNo, kNo,
m, n, k,
alpha,
device_a, 0, a_ld,
device_b, 0, b_ld,
beta,
device_c, 0, c_ld,
&queue2, &event);
clWaitForEvents(1, &event);
printf("Second run (binary taken from cache) completed with status %d\n", status);
If you could provide me with a C or C++ example that has this behaviour, that would be really helpful.
There is now also a ClearCompiledProgramCache()
(C++) and CLBlastClearCompiledProgramCache()
(C) function in case you want to double-check whether it is a cache-related issue or not.
I am out of town this week, but as soon as I get back I will investigate this.
I had to look it up, so summarize it here shortly: The error code is defined in clblast.h
, as
// Custom additional status codes for CLBlast
kKernelLaunchError = -2048, // Problem occurred when enqueuing the kernel
As far as I see, it only apears in routine.cc
, as
// Launches the kernel (and checks for launch errors)
try {
kernel.Launch(queue_, global, local, event, waitForEvents);
} catch (...) { return StatusCode::kKernelLaunchError; }
The kernel.Launch
call ends in clpp11.h
, where (in all paths) the results of the innermost calls of clEnqueueNDRangeKernel
are passed through the CheckError
function, which is implemented as
// Error occurred in OpenCL
inline void CheckError(const cl_int status) {
if (status != CL_SUCCESS) {
throw std::runtime_error("Internal OpenCL error: "+std::to_string(status));
}
}
So maybe I misunderstood something here, but this looks like whatever error is caused by clEnqueueNDRangeKernel
, it will cause an exception to be thrown, which eventually will show up as the kKernelLaunchError
. The original error code will be lost in translation. Maybe there is a way to preserve this error code? Like throwing an own exception type that stores the actual CL error code, and allows changing the catch
block to something like
catch (MySpecialException& e) { return e.getStoredErrorCodeFromEnqueueCall(); }
catch (...) { return StatusCode::kKernelLaunchError; } // As a last resort fallback
(Just a guess, not sure whether this makes sense or really works this way)
Regarding the actual issue or its reason: I just tried this out, using a slightly modified version of the SGEMM example for JOCLBlast (roughly like Cedric described above), and also received the -2048.
But...
destroying the context is a real nuke. From my understanding, it will really desctroy everything. So if you destroy the context, then you do not only have to re-create the command queue, but also all memory objects - so, basically do a complete "shutdown and restart" (and I suspect that this would work, although I didn't test it explicitly).
So @blueberry could you provide a few lines of example code showing how you intended to manage this context change? (Maybe even in Clojure, now I might even be able to try this out ;-))
EDIT: @CNugteren In https://github.com/CNugteren/CLBlast/issues/47#issuecomment-215074336 you printed the status
, and not the status2
- I haven't yet tried this in "plain" CLBlast, but this may be important here.
@gpu I do (re)create everything. And it works ok with my kernels that are 2.0 specific, so this issue is somehow related to clblast. Basically, All my vectors and matrices have an engine attached and all live in a certain context that is provided by a caller. The idea is to avoid global singleton context but to leave to the caller to decide how the context and queue are managed.
At least we identified that the problem is either in JOCLBlast or in CLBlast if status2 is -2048 in cedric's example.
I'll prepare a clojure project for you.
@gpu Actually, there are examples in the test folder and even a tutorial here http://neanderthal.uncomplicate.org/articles/tutorial_opencl.html To run it with the currently released Neanderthal 0.5.0, you'll need OpenCL 2.0 capable device.
Don't put tooo much effort into this: After modifying the sgemm.c
example as desribed in https://github.com/CNugteren/CLBlast/issues/47#issuecomment-215074336 , and blatantly adding an output in CheckError
...
// Error occurred in OpenCL
inline void CheckError(const cl_int status) {
if (status != CL_SUCCESS) {
printf("Return code is %d\n", status);
throw std::runtime_error("Internal OpenCL error: "+std::to_string(status));
}
}
it prints
Completed with event 0000005C9B1CDE60
Completed with status 0
Return code is -34
Second run (binary taken from cache) completed with status -2048
(printing the status2
here, of course). So the error code -34 (which is CL_INVALID_CONTEXT
) shows up as -2048, as described above. I tried to figure out where it might get this invalid context from, but got a bit lost in the C++ structures (Routine
, Queue
etc) - I'd have to read this more thoroughly
A minor thing (not worth an own issue, I guess) : In clpp11.h
, line 636, at the call
CheckError(clEnqueueNDRangeKernel(queue(), *kernel_, static_cast<cl_uint>(global.size()),
nullptr, global.data(), local.data(),
waitForEventsPlain.size(), waitForEventsPlain.data(),
event));
it compains about the conversion from size_t
to cl_int
. It should be static_cast<cl_uint>(waitForEventsPlain.size())
(similar to the cast of global.size()
).
Thanks for looking into this, both of you. Indeed, I checked the wrong status previously. I now extended the example at line 94 with the following:
// Test code for a second call with a new context
clReleaseMemObject(device_a);
clReleaseMemObject(device_b);
clReleaseMemObject(device_c);
clReleaseCommandQueue(queue);
clReleaseContext(context);
cl_context context2 = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);
cl_command_queue queue2 = clCreateCommandQueue(context2, device, 0, NULL);
cl_mem device_a2 = clCreateBuffer(context2, CL_MEM_READ_WRITE, m*k*sizeof(float), NULL, NULL);
cl_mem device_b2 = clCreateBuffer(context2, CL_MEM_READ_WRITE, n*k*sizeof(float), NULL, NULL);
cl_mem device_c2 = clCreateBuffer(context2, CL_MEM_READ_WRITE, m*n*sizeof(float), NULL, NULL);
clEnqueueWriteBuffer(queue2, device_a2, CL_TRUE, 0, m*k*sizeof(float), host_a, 0, NULL, NULL);
clEnqueueWriteBuffer(queue2, device_b2, CL_TRUE, 0, n*k*sizeof(float), host_b, 0, NULL, NULL);
clEnqueueWriteBuffer(queue2, device_c2, CL_TRUE, 0, m*n*sizeof(float), host_c, 0, NULL, NULL);
CLBlastClearCompiledProgramCache(); // This line is needed
StatusCode status2 = CLBlastSgemm(kRowMajor, kNo, kNo,
m, n, k,
alpha,
device_a2, 0, a_ld,
device_b2, 0, b_ld,
beta,
device_c2, 0, c_ld,
&queue2, &event);
clWaitForEvents(1, &event);
printf("Second run completed with status %d\n", status2);
I now get -48 if I omit CLBlastClearCompiledProgramCache()
, which means CL_INVALID_KERNEL
. It seems your first hunch was right, it has something to do with the cache. I'll look into this tomorrow.
One thing that I suspect is this: the program is created using a context, and then compiled with clBuildProgram. Now, there are two ways to create that program: clCreteProgramWithSource and clCreateProgramWithBinary. If compiled binaries could be reused between contexts (this error seems to point to point to that they couldn't) CLCreateProgramFromBinary wouldn't be much useful. Indeed, see the notes at https://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/clCreateProgramWithBinary.html They recommend this route for caching (although I would still prefer an explicit initialize() function just to make sure that invocations would be at full speed even the first time the function is called)
OK, I identified the issue. What CLBlast stores in its cache is the cl_program
object. However, it seems that this is associated with a specific context, hence the issue you are encountering when CLBlast is called on a different context afterwards.
The solution is to use clGetProgramInfo
to retrieve the actual binary from the cl_program
, store that in the cache, and then later on create a new cl_program
using clCreateProgramWithBinary
. I'll need to redesign some parts of CLBlast for this.
Be careful, though, since clCreateProgramWithBinary might induce a big overhead (might, i am not sure exactly what's the speed of it) if it needs to be called each time something is taken from the cache. Maybe a better solution in the case of a big overhead would be a per-context cache, where cache can be initialized/cleaned for each context explicitly if that is important, and also can be global as it is now if the application does not use more than one context?
I understand, but most probably the bulk of the compilation is already done. I quickly tested on my system and that seems to be the case. But it might indeed be different on other systems.
For now, there is version that should work in development
. It caches the binary instead of the cl_program
. If this turns out to be a performance issue, we'll look at it again in a later stage. But I think this is what the OpenCL manual means with caching. Also, I believe this is what clBLAS is doing, although I am not 100% certain.
Also, the clear-cache functions are now named clblast::ClearCache
(C++) and CLBlastClearCache
(C).
Thanks again both for pointing out this issue.
( @blueberry https://github.com/gpu/JOCLBlast/issues/5 is still open, I'll rename the functions soon ;-))
Also, the clear-cache functions are now named clblast::ClearCache (C++) and CLBlastClearCache (C).
Although those are in the clblast.h header, they are missing PUBLIC_API in clblast.cpp, so you will get a link failure if you try to use them.
You are right, they were missing from the C++ version. They were already exported in the C API though. I now added the PUBLIC_API for the C++ API to the clblast.h file (these functions are not templated so I think that is the right place for the declspec attribute). This is part of the b330ab0 commit. This commit also adds __declspec(dllimport)
in case the headers are included but the library is already built.
I don't have a Windows system to test on right now, but the AppVeyor builds seem to pass.
Does it affect 0.8.0? @gpu plans to release JOCLBlast next week, so perhaps this requires a 0.8.1 release?
The commit is made to the development branch, so it is not in 0.8.0. I am not a Windows export, so I don't know how important the missing __declspec(dllimport)
is, but I haven't heard of any issues. It was the same in 0.7.0 and 0.7.1. The missing __declspec(dllexport)
from the cache functions in the C++ API is not important enough for a new release I would say, since those functions are unlikely to be used anyway.
Indeed, at least for JOCLBlast, this is not critical, because it does not use the C++ API at all.
This is the actual cause of previously reported Issue #43
I am using CLBlast (development branch) through JOCLBlast, but with enough experience, I pretty much suspect that this issue is due to CLBlast caching (although I didn't try this directly in C++ since my C++ foo is too rusty).
What's happening:
It seems to me that somehow the old, cached stuff gets mixed up with the newly provided queue.