Open bentsherman opened 5 years ago
So I followed this stack trace and I'm looking at a few functions in particular in ACE:
CUDA::Kernel::execute()
CUDARunThread::run()
EApplication::notify()
And from these functions I can tell that ACE is supposed to catch any ACE-specific exceptions from worker threads and re-throw them in the main thread so that they are properly handled. So it seems like if CUDA::Kernel::execute()
threw an exception then it should have been handled properly, but it wasn't.
So this might actually be an ACE issue. @4ctrl-alt-del any thoughts on this? The command line that I used is in the log but I can try to come up with a more reproducible test case.
After closely inspecting the code you added for CUDA in ACE I did notice an error you made that could cause this. if you look at ace_analytic_cudarunthread.cpp:139 you are calling the set current method that can throw exceptions OUTSIDE of the try statement. This is a bug regardless and could cause the error you are getting.
@4ctrl-alt-del Good catch, I have wrapped that call in a separate try/catch statement and I will make a PR for it later. However, I am still getting the same error as before. See this line in particular:
[node0181:14603] [ 8] /home/btsheal/software/ACE/develop/lib/libacecore.so.0(_ZN4CUDA6Kernel7executeERKNS_6StreamE+0x143)[0x152603124be3]
This is how I know the exception is coming from CUDA::Kernel::execute()
. But I don't see why it isn't being handled properly. I'm looking at this code:
void CUDARunThread::run()
{
// ...
try
{
_result = _worker->execute(_work).release();
_result->moveToThread(thread());
_result->setParent(this);
}
catch (EException e)
{
_exception = new EException(e);
}
// ...
}
So any exception from a CUDA kernel should be caught right? Would the moveToThread()
call affect anything?
moveToThread() simply moves the thread ownership of a qt object, it does nothing in regards to thread creation/destruction.
It should be caught in any thread that is created by ACE. My guess is somehow a thread is being made outside of ACE that is not protected by a try and catch.
Note to self -- this error was caused by setting the CUDA block size too high (1024), and I was able to reproduce a similar error with the ACE example. So ACE/KINC should definitely exit more gracefully, I'm just still trying to figure out if the issue lies with ACE or with KINC.
A lot of times when KINC crashes I get a long stack trace like this one:
I can usually pick out where it's coming from (in this case I think a CUDA kernel failed to launch), but I really need to see the error message. I'm going to try to fix this by inserting try / catch statements into KINC but we may need to make some changes in ACE too, we'll see.