SystemsGenetics / KINC

Knowledge Independent Network Construction
MIT License
11 stars 4 forks source link

Exception handling for CUDA/OpenCL errors #97

Open bentsherman opened 5 years ago

bentsherman commented 5 years ago

A lot of times when KINC crashes I get a long stack trace like this one:

terminate called after throwing an instance of 'EException*'
[node0181:14603] *** Process received signal ***
[node0181:14603] Signal: Aborted (6)
[node0181:14603] Signal code:  (-6)
[node0181:14603] [ 0] /lib64/[0x1525f65555d0]
[node0181:14603] [ 1] /lib64/[0x1525f56f92c7]
[node0181:14603] [ 2] /lib64/[0x1525f56fa9b8]
[node0181:14603] [ 3] /software/gcc/5.4.0/lib64/[0x1525f625997d]
[node0181:14603] [ 4] /software/gcc/5.4.0/lib64/[0x1525f62579f6]
[node0181:14603] [ 5] /software/gcc/5.4.0/lib64/[0x1525f6257a41]
[node0181:14603] [ 6] /software/gcc/5.4.0/lib64/[0x1525f6257c59]
[node0181:14603] [ 7] /home/btsheal/software/ACE/develop/lib/[0x152603121e9f]
[node0181:14603] [ 8] /home/btsheal/software/ACE/develop/lib/[0x152603124be3]
[node0181:14603] [ 9] kinc[0x462562]
[node0181:14603] [10] kinc[0x4582a2]
[node0181:14603] [11] /home/btsheal/software/ACE/develop/lib/[0x152603112f46]
[node0181:14603] [12] /software/Qt/5.9.2/lib/[0x1525f6a9a06d]
[node0181:14603] [13] /lib64/[0x1525f654ddd5]
[node0181:14603] [14] /lib64/[0x1525f57c102d]
[node0181:14603] *** End of error message ***
/pscratch/scratch4/btsheal/benchmark-nf/work/e8/7afb360a11dbef9eae776a2eae4f90/ line 7: 14603 Aborted                 (core dumped) taskset -c 0-1 kinc run similarity --input Yeast.emx --ccm Yeast.ccm --cmx Yeast.cmx --clusmethod gmm --corrmethod spearman --preout true --postout true --bsize 32768 --gsize 4096 --lsize 1024

I can usually pick out where it's coming from (in this case I think a CUDA kernel failed to launch), but I really need to see the error message. I'm going to try to fix this by inserting try / catch statements into KINC but we may need to make some changes in ACE too, we'll see.

bentsherman commented 5 years ago

So I followed this stack trace and I'm looking at a few functions in particular in ACE:


And from these functions I can tell that ACE is supposed to catch any ACE-specific exceptions from worker threads and re-throw them in the main thread so that they are properly handled. So it seems like if CUDA::Kernel::execute() threw an exception then it should have been handled properly, but it wasn't.

So this might actually be an ACE issue. @4ctrl-alt-del any thoughts on this? The command line that I used is in the log but I can try to come up with a more reproducible test case.

4ctrl-alt-del commented 5 years ago

After closely inspecting the code you added for CUDA in ACE I did notice an error you made that could cause this. if you look at ace_analytic_cudarunthread.cpp:139 you are calling the set current method that can throw exceptions OUTSIDE of the try statement. This is a bug regardless and could cause the error you are getting.

bentsherman commented 5 years ago

@4ctrl-alt-del Good catch, I have wrapped that call in a separate try/catch statement and I will make a PR for it later. However, I am still getting the same error as before. See this line in particular:

[node0181:14603] [ 8] /home/btsheal/software/ACE/develop/lib/[0x152603124be3]

This is how I know the exception is coming from CUDA::Kernel::execute(). But I don't see why it isn't being handled properly. I'm looking at this code:

void CUDARunThread::run()
   // ...
            _result = _worker->execute(_work).release();
         catch (EException e)
            _exception = new EException(e);
   // ...

So any exception from a CUDA kernel should be caught right? Would the moveToThread() call affect anything?

4ctrl-alt-del commented 5 years ago

moveToThread() simply moves the thread ownership of a qt object, it does nothing in regards to thread creation/destruction.

It should be caught in any thread that is created by ACE. My guess is somehow a thread is being made outside of ACE that is not protected by a try and catch.

bentsherman commented 4 years ago

Note to self -- this error was caused by setting the CUDA block size too high (1024), and I was able to reproduce a similar error with the ACE example. So ACE/KINC should definitely exit more gracefully, I'm just still trying to figure out if the issue lies with ACE or with KINC.