Segfault on kernel waiting for list of events

WrathfulSpatula commented 5 years ago

doe300, as you know, back on the Qrack project's Raspberry Pi 3 support pull request at https://github.com/vm6502q/qrack/pull/143, we get a segfault if our kernel calls receive a large list of events to wait on, but not if the wait list is manually waited on before calling the kernel.

The wait list is daisy-chained. We pass only the direct wait list dependencies for the kernel, while the events in the list can depend on their own separate sets of dependencies that aren't in the kernel list.

We have this pattern, (which works):

#if ENABLE_RASPBERRYPI
    clFinish();
#endif

    // Dispatch the primary kernel, to apply the gate.
    cl::Event kernelEvent;
    std::vector<cl::Event> kernelWaitVec = device_context->ResetWaitEvents();
    queue.enqueueNDRangeKernel(ocl.call, cl::NullRange, // kernel, offset
        cl::NDRange(workItemCount), // global number of work items
        cl::NDRange(localGroupSize), // local number (per group)
        &kernelWaitVec, // vector of events to wait for
        &kernelEvent); // handle to wait for the kernel

    queue.flush();

#if ENABLE_RASPBERRYPI
    clFinish();
#endif

The clFinish() method is a user code "soft finish" implementation; all it does is manually wait on the list of events returned by device_context->ResetWaitEvents(), and it clears the list. In the simplest case we've tested, the list of events just read into the set of buffers needed before calling the kernel. We pass these events in the list to the kernel, ans we manually wait on the same buffer loads before exiting our method, to make sure that the loads complete before we could lose ephemeral inputs for the buffers.

This implementation has been tested extensively on NVIDIA GTX cards, Intel Core HDs, Intel Cores, and and Intel Xeons, and it's never been an issue anywhere except on the Raspberry Pi.

I think this kind of fall-through asynchronous behavior is asking a lot of the Raspberry Pi's resources, but I don't think there's a logical problem with the chain of dependencies, if we remove the clFinish() calls for the Raspberry Pi.

doe300 commented 5 years ago

Lets see, if I understand this correctly:

with clFinish() waiting on all events which belong to this objects, everything works okay (except that the current thread has to wait, which is not good performance wise).
without clFinish() you get a SEGFAULT with VC4CL?

Could you run the execution with gdb or valgrind (preferably with debug build, CMAKE_BUILD_TYPE=Debug) and attach the outputs (esp. the stacktrace)?

WrathfulSpatula commented 5 years ago

You are correct, that this is what is happening. It will take me some time to get the profiling output, (our unit tests take quite some time on the Raspberry Pi 3,) but I will get that for you today.

Maybe I can strip down the unit tests to a minimum example that triggers the issue, for everyone's ease.

WrathfulSpatula commented 5 years ago

I built with CMAKE_BUILD_TYPE=Debug and ran our unit tests with valgrind. Execution seemed to hang for a long time before I ^Z^C-ed on the command line. It looks like it hung at about the point I'd expect, in building the wait list dependency graph. The output is here:

https://gist.github.com/WrathfulSpatula/55253877a3b29ae7fc3a5bbc1329e8e4

I'm running this again, and I'm letting it go for as long as it wants, hours if necessary. I'll revise the gist above, if anything turns up.

WrathfulSpatula commented 5 years ago

If I leave this running for over an hour, it gets farther, but my Raspberry Pi becomes unresponsive. I managed to capture the output with a valgrind log file. I've updated the output, at the same link above.

I've noticed that, if I put in std::cout output, the segfault doesn't necessarily happen, but the unit tests still fail with the wrong results. So, there might be an issue with unreliable asynchronous output.

I'm not positive that I've captured the issue, yet, but I'm also looking over this output for anything indicative of a bug in our Qrack project. I'll let you know if I find the bug on our end. I'm not sure why QEngineCPU from our project is showing up, because it should be skipped in this case to just run the QEngineOCL tests, but I'll let you know if I come up with anything useful.

doe300 commented 5 years ago

So what I can see in the valgrind output:

There are some issues with QEngineCPU (see L389 and following lines) being freed with the wrong method.
Somewhere events are waited on which is already freed (L389, L389ff). This could be the related to the problem with the clEnqueueNDRange and wait-list parameter.

Sadly, the stack traces are not longer for these cases, but I would assume that the error occurs exactly where then the SEGFAULT occurs. This looks like events are added to the wait-list which are already freed, which means that somewhere (in client or library code), clRetainEvent() is called one time to few or clReleaseEvent() one time to many before the events are added to the wait-list.

WrathfulSpatula commented 5 years ago

Thank you. I'm using the OpenCL C++ headers, so I don't directly call clRetainEvent() or clReleaseEvent(), but something I am doing is probably triggering this. The kernel needs to wait on the buffer loading events, but those also need to be (separately) guaranteed to complete before we exit the same method, so that we don't lose ephemeral array parameters to load into those buffers, if the kernel is queued but not run before finishing the method.

I realize now, I keep two variable references to these wait events, to cover both of the above dependency conditions, but the C++ headers assume that most OpenCL objects are unique, such that a deallocator/destructor is probably called twice on each buffer load event.

If keeping the wait event objects unique fixes the problem, I'll let you know soon, and we can close the issue. Thank you again!

WrathfulSpatula commented 5 years ago

The bug appears to have indeed been on our end, exactly due to not properly handling the cl::Event objects of the C++ bindings as unique objects. Since the instances must be considered unique, we must not try to duplicate them on stack, and we must also avoid invoking their destructors by calling clear() on vectors that contain them. Thank you again!

doe300 / VC4CL

Segfault on kernel waiting for list of events #55