Closed OursDesCavernes closed 3 years ago
As a general statement, I do not like the exit() call. It is always confusing when a program abruptly terminates, especially when you didn't write the code. I would like to get rid of the exit's (there are several), but I do not know how the rest of the program would behave without them. That's an untested scenario.
The checkErrorFunc() function seems to be used for many different reasons; looks like it is also used to check the return value of clgetdeviceID and other system queries.
The safest route to change this is to limit the scope of the change. Maybe change the signature of checkErrorFunc() to recieve a bool; most calls would pass in a value where the program would still exit, like in a failure with clgetdeviceid(). For the few cases that you are interested in continuing execution regardless of the failed kernel execution, you could pass in a bool that skips the exit() call and see how the tuner behaves.
Does that make sense?
I added checkErrorFunc2() that returns false instead of exiting.
With this modification of runKernel():
status = clEnqueueNDRangeKernel(genInfo.queue, kernel, param->pgran.wgDim,
NULL, globalWorkSize, localWorkSize,
0, NULL, &evt);
clReleaseKernel(kernel);
if(checkErrorFunc2("clEnqueueNDRangeKernel",status) == false)
{
status = flushAll(genInfo.queue);
checkErrorFunc("flushAll", status);
status = waitForSuccessfulFinish(genInfo.queue, &evt);
checkErrorFunc("waitForSuccessfulFinish", status);
ret = (double)-1;
fprintf(logStream, "skipping\n");
fflush(logStream);
}
I get this:
$ ./clBLAS-tune --float --syr2k
SYR2K is being tuned, progress: 0.20% drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
0.41% drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
0.61% drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
...
1.84% drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
Killed
I"ll try to modify the tuner logic so that it gives up at first fail.
I'm no longuer using this hardware, closing. Feel free to re-open if any has the issue.
This a similar to #136 where failing one kernel with specific args fails the whole tuning.
My opencl driver and hardware:
I get this:
This behavior comes from tune.c#L479:
exit(EXIT_COD_CL_ERROR);
I use clBLAS with deepcl and deepcl skips a failing kernel instead of dying during the kernel selection process.
Better error handling would be an easy way to have a good compatibility with not AMD devices.
What do you think ?
I'll try to get some code working this way. Some tips are welcome as I'm not familliar to clBLAS code or OpenCl development.
Thank you.