clMathLibraries / clBLAS

a software library containing BLAS functions written in OpenCL
Apache License 2.0
839 stars 240 forks source link

clBLAS-tune fails on Intel(R) HD Graphics Haswell Ultrabook GT2 Mobile #268

Closed OursDesCavernes closed 3 years ago

OursDesCavernes commented 8 years ago

This a similar to #136 where failing one kernel with specific args fails the whole tuning.

My opencl driver and hardware:

$ gpuinfo 
num platforms: 1

platform index: 0:
platform id: 0x7f2d29d3a9c0
platform vendor: Intel
platform name: Intel Gen OCL Driver
platform num devices: 1

   device index: 0
   device id: 0x7f2d29d3bbc0
   device type: 4
   global memory size: 2048MB
   local memory size: 64KB
   global cache size: 8KB
   global cacheline size: 64
   max memory alloc size: 1024MB
   max compute units: 20
   max workgroup size: 512
   max workitem dimensions: 3
   max workitem sizes: 512 512 512
   device name: Intel(R) HD Graphics Haswell Ultrabook GT2 Mobile
   opencl c version: OpenCL C 1.2 beignet 1.1.2
   opencl device version: OpenCL 1.2 beignet 1.1.2
   frequency MHz: 100

I get this:

$ CLBLAS_STORAGE_PATH=".clblas" clBLAS-tune --float --gemm"
GEMM is being tuned, progress:  0.02% drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES)

This behavior comes from tune.c#L479: exit(EXIT_COD_CL_ERROR);

I use clBLAS with deepcl and deepcl skips a failing kernel instead of dying during the kernel selection process.

Better error handling would be an easy way to have a good compatibility with not AMD devices.

What do you think ?

I'll try to get some code working this way. Some tips are welcome as I'm not familliar to clBLAS code or OpenCl development.

Thank you.

kknox commented 8 years ago

As a general statement, I do not like the exit() call. It is always confusing when a program abruptly terminates, especially when you didn't write the code. I would like to get rid of the exit's (there are several), but I do not know how the rest of the program would behave without them. That's an untested scenario.

The checkErrorFunc() function seems to be used for many different reasons; looks like it is also used to check the return value of clgetdeviceID and other system queries.

The safest route to change this is to limit the scope of the change. Maybe change the signature of checkErrorFunc() to recieve a bool; most calls would pass in a value where the program would still exit, like in a failure with clgetdeviceid(). For the few cases that you are interested in continuing execution regardless of the failed kernel execution, you could pass in a bool that skips the exit() call and see how the tuner behaves.

Does that make sense?

OursDesCavernes commented 8 years ago

I added checkErrorFunc2() that returns false instead of exiting.

With this modification of runKernel():

    status = clEnqueueNDRangeKernel(genInfo.queue, kernel, param->pgran.wgDim,
                                    NULL, globalWorkSize, localWorkSize,
                                    0, NULL, &evt);
    clReleaseKernel(kernel);
    if(checkErrorFunc2("clEnqueueNDRangeKernel",status) == false)
    {
        status = flushAll(genInfo.queue);
        checkErrorFunc("flushAll", status);
        status = waitForSuccessfulFinish(genInfo.queue, &evt);
        checkErrorFunc("waitForSuccessfulFinish", status);
        ret = (double)-1;
        fprintf(logStream, "skipping\n");
        fflush(logStream);

    }

I get this:

$ ./clBLAS-tune --float --syr2k
SYR2K is being tuned, progress:  0.20% drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
 0.41% drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
 0.61% drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping

...

 1.84% drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
drm_intel_gem_bo_context_exec() failed: Input/output error
clEnqueueNDRangeKernel() failed with -5(CL_OUT_OF_RESOURCES) ***
skipping
Killed

I"ll try to modify the tuner logic so that it gives up at first fail.

OursDesCavernes commented 3 years ago

I'm no longuer using this hardware, closing. Feel free to re-open if any has the issue.