Closed WrathfulSpatula closed 5 years ago
Lets see, if I understand this correctly:
clFinish()
waiting on all events which belong to this objects, everything works okay (except that the current thread has to wait, which is not good performance wise).clFinish()
you get a SEGFAULT with VC4CL?Could you run the execution with gdb
or valgrind
(preferably with debug build, CMAKE_BUILD_TYPE=Debug
) and attach the outputs (esp. the stacktrace)?
You are correct, that this is what is happening. It will take me some time to get the profiling output, (our unit tests take quite some time on the Raspberry Pi 3,) but I will get that for you today.
Maybe I can strip down the unit tests to a minimum example that triggers the issue, for everyone's ease.
I built with CMAKE_BUILD_TYPE=Debug
and ran our unit tests with valgrind. Execution seemed to hang for a long time before I ^Z^C
-ed on the command line. It looks like it hung at about the point I'd expect, in building the wait list dependency graph. The output is here:
https://gist.github.com/WrathfulSpatula/55253877a3b29ae7fc3a5bbc1329e8e4
I'm running this again, and I'm letting it go for as long as it wants, hours if necessary. I'll revise the gist above, if anything turns up.
If I leave this running for over an hour, it gets farther, but my Raspberry Pi becomes unresponsive. I managed to capture the output with a valgrind log file. I've updated the output, at the same link above.
I've noticed that, if I put in std::cout
output, the segfault doesn't necessarily happen, but the unit tests still fail with the wrong results. So, there might be an issue with unreliable asynchronous output.
I'm not positive that I've captured the issue, yet, but I'm also looking over this output for anything indicative of a bug in our Qrack project. I'll let you know if I find the bug on our end. I'm not sure why QEngineCPU
from our project is showing up, because it should be skipped in this case to just run the QEngineOCL
tests, but I'll let you know if I come up with anything useful.
So what I can see in the valgrind output:
QEngineCPU
(see L389 and following lines) being freed with the wrong method.clEnqueueNDRange
and wait-list parameter.Sadly, the stack traces are not longer for these cases, but I would assume that the error occurs exactly where then the SEGFAULT occurs. This looks like events are added to the wait-list which are already freed, which means that somewhere (in client or library code), clRetainEvent()
is called one time to few or clReleaseEvent()
one time to many before the events are added to the wait-list.
Thank you. I'm using the OpenCL C++ headers, so I don't directly call clRetainEvent()
or clReleaseEvent()
, but something I am doing is probably triggering this. The kernel needs to wait on the buffer loading events, but those also need to be (separately) guaranteed to complete before we exit the same method, so that we don't lose ephemeral array parameters to load into those buffers, if the kernel is queued but not run before finishing the method.
I realize now, I keep two variable references to these wait events, to cover both of the above dependency conditions, but the C++ headers assume that most OpenCL objects are unique, such that a deallocator/destructor is probably called twice on each buffer load event.
If keeping the wait event objects unique fixes the problem, I'll let you know soon, and we can close the issue. Thank you again!
The bug appears to have indeed been on our end, exactly due to not properly handling the cl::Event
objects of the C++ bindings as unique objects. Since the instances must be considered unique, we must not try to duplicate them on stack, and we must also avoid invoking their destructors by calling clear()
on vectors that contain them. Thank you again!
doe300, as you know, back on the Qrack project's Raspberry Pi 3 support pull request at https://github.com/vm6502q/qrack/pull/143, we get a segfault if our kernel calls receive a large list of events to wait on, but not if the wait list is manually waited on before calling the kernel.
The wait list is daisy-chained. We pass only the direct wait list dependencies for the kernel, while the events in the list can depend on their own separate sets of dependencies that aren't in the kernel list.
We have this pattern, (which works):
The
clFinish()
method is a user code "soft finish" implementation; all it does is manually wait on the list of events returned bydevice_context->ResetWaitEvents()
, and it clears the list. In the simplest case we've tested, the list of events just read into the set of buffers needed before calling the kernel. We pass these events in the list to the kernel, ans we manually wait on the same buffer loads before exiting our method, to make sure that the loads complete before we could lose ephemeral inputs for the buffers.This implementation has been tested extensively on NVIDIA GTX cards, Intel Core HDs, Intel Cores, and and Intel Xeons, and it's never been an issue anywhere except on the Raspberry Pi.
I think this kind of fall-through asynchronous behavior is asking a lot of the Raspberry Pi's resources, but I don't think there's a logical problem with the chain of dependencies, if we remove the
clFinish()
calls for the Raspberry Pi.