ProjectPhysX / OpenCL-Wrapper

OpenCL is the most powerful programming language ever created. Yet the OpenCL C++ bindings are cumbersome and the code overhead prevents many people from getting started. I created this lightweight OpenCL-Wrapper to greatly simplify OpenCL software development with C++ while keeping functionality and performance.
Other
311 stars 36 forks source link

Creative solution for Debugging out of Kernels? Here is mine. Any interesting ideas you have come up with? #15

Closed jonmdev closed 2 months ago

jonmdev commented 2 months ago

Again, I can't thank you enough for this project and all your many replies and posts on StackExchange and even Reddit which I have found while researching how to do basic things. I now have a working implementation of my project, and only 8 days after I first started. To get up and running with OpenCL and convert a project over in only 8 days is a testament to your good design and explanations.

I just have one more question.

It is challenging to see what is happening inside the Kernels. For example, if you access whatever[i] and [i] is not in range, you will typically get errors in Visual Studio, but the Kernel says nothing if you do this inside it.

It is hard to also see what points and if etc. are being hit. My best idea was the following:

1) Create Memory & Memory objects and pass into Kernel:

For example, in Kernel design, add the following parameters:

//debug
global char* debugChar,
global float* debugFloat, 
global int* dbgIndexC,
global int* dbgIndexF,

int maxDebugChar

Here debugChar and debugFloat are Memory<char>(device, maxDebugChar) & Memory<float>(device, maxDebugChar).

dbgIndexC and dbgIndexF are Memory<int>(device, 1) and Memory<int>(device, 1) as indexes each initialized to 0 so you can increment globally an index with each new addition per kernel run.

2) Use inside Kernel:

I have found the debugFloat most helpful as it maintains chronology to just use one buffer and putting in strings is too hard as char. Putting in floats or ints as char is also too hard.

So for example, you can do:

const uint i = get_global_id(0); //gets index of array 0 to n
const uint g = get_group_id(0); //get workgroup
const uint groupNums = get_num_groups(0); //get num groups

 //USE ANY RANDOM NUMBER SO YOU CAN SEARCH FOR IT AS "START" OF DEBUG STATEMENT (here 1.333)
if (dbgIndexF[0] < maxDebugChar) { debugFloat[dbgIndexF[0]] = 1.333; dbgIndexF[0]++; }
if (dbgIndexF[0] < maxDebugChar) { debugFloat[dbgIndexF[0]] = i; dbgIndexF[0]++; }
if (dbgIndexF[0] < maxDebugChar) { debugFloat[dbgIndexF[0]] = g; dbgIndexF[0]++; }
if (dbgIndexF[0] < maxDebugChar) { debugFloat[dbgIndexF[0]] = groupNums; dbgIndexF[0]++; }
if (dbgIndexF[0] < maxDebugChar) { debugFloat[dbgIndexF[0]] = 4.222; dbgIndexF[0]++; } 
//USE ANY RANDOM NUMBER SO YOU CAN SEARCH FOR IT AS "END" OF DEBUG STATEMENT (here 4.222)

Or alternatively, you can try for char, but this is very tedious, and since you can't add floats/doubles/ints into the char array easily it is less useful:

if (dbgIndexC[0] < maxDebugChar) { debugChar[dbgIndexC[0]] = 'S'; dbgIndexC[0]++; }
if (dbgIndexC[0] < maxDebugChar) { debugChar[dbgIndexC[0]] = 'T'; dbgIndexC[0]++; }
if (dbgIndexC[0] < maxDebugChar) { debugChar[dbgIndexC[0]] = 'A'; dbgIndexC[0]++; } 
if (dbgIndexC[0] < maxDebugChar) { debugChar[dbgIndexC[0]] = 'R'; dbgIndexC[0]++; } 
if (dbgIndexC[0] < maxDebugChar) { debugChar[dbgIndexC[0]] = 'T'; dbgIndexC[0]++; } 

3) Get and print out the Debug Info:

After the Kernel runs, run a function to process and print out the debug info in whatever way the system needs. Like for example:

void readDebugData() {

    debugChar->read_from_device();
    std::string debugString = "";
    for (int i = 0; i < maxDebugChar; i++) { debugString += (*debugChar)[i];  /*clear it*/  (*debugChar)[i] = ' '; }
    DBG(debugString);

    debugFloat->read_from_device();
    debugString = "";
    for (int i = 0; i < maxDebugChar; i++) { debugString += " | " + cl_to_string((*debugFloat)[i]); /*clear it*/ (*debugFloat)[i] = 4; }
    DBG(debugString);

        //do something to clear the debug buffers
    debugChar->write_to_device(); //clear
    debugFloat->write_to_device(); //clear
    dbgIndexC->write_to_device(); //clear 
    dbgIndexF->write_to_device(); //clear
}

Ideas?

That was my best idea and it works at least okay. Without it I could never have figured out how to use the kernels or how they were allocating into workgroups etc.

However, it can also crash the Kernel causing it to hang for 400 ms which I presume is the device timeout and then OpenCL just stops responding to future requests. Ie. this is not being "workgroup safe."

I presume this is being triggered when multiple workgroups all try to write to the same debug index/array at once. So this method is not exactly good or needs to be improved though it is at least somewhat useful.

Additionally, besides compilation errors, there are still no obvious good ways I can think of to be alerted if you do something wrong, like outside range attempts to read something, and it is hard to find Kernel code mistakes. Eg. trying to read inside the kernel dbgIndexF[-1] (which doesn't exist) creates no error. Interestingly, this returns 0 for me when I try to debug out the value using the method above, ie:

    int testBreak = dbgIndexF[-1]; //doesn't exist
    if (dbgIndexF[0] < maxDebugChar) { debugFloat[dbgIndexF[0]] = testBreak; dbgIndexF[0]++; }
    //this returns 0 when I read it outside the kernel

However, I presume this is just "undefined behavior". I only caught some mistakes I made by copying my kernel out and rephrasing it into regular code and running it on the CPU to see what would happen.

You have obviously been at this longer than me and understand the system better.

I am just wondering if you have come up with any different or better methods for (1) Debugging things out, and (2) Catching Kernel code errors.

Thanks for any thoughts as usual, and thanks again for letting me get into GPU work so quickly and (relatively) painlessly. 🙂

ProjectPhysX commented 2 months ago

Hi @jonmdev,

I can give you some more debugging tricks on hand:

A very nasty bug is when data types mismatch between Memory<type> on host code and kernel(type* ...) on device side. I've already built-in a check that will error for mismatching type sizes, like float/double, and for mismatching number/order of kernel parameters between host/device side. But it can't detect mismatching types of the same size, such as uint/float.

Kind regards, Moritz