SystemsGenetics / ACE

Accelerated Computational Engine (ACE) is a GPU-enabled framework to simplify creation of GPU-capable applications
http://SystemsGenetics.github.io/ACE
GNU General Public License v2.0
1 stars 1 forks source link

CUDA Causing Unknown Exception #98

Closed MitchsGreer closed 5 years ago

MitchsGreer commented 5 years ago

There is an issue with CUDA using a nvidia card as the default to run operations, if the correct drivers are not installed, then an unknown acceptation is thrown and the program exits when it tries to access the card. The problem does not arise when the CUDA setting is set to none by the user.

4ctrl-alt-del commented 5 years ago

Hey @bentsherman I don't think this is good behavior for the end user to have the program crash with "unknown exception" if the user's computer doesn't have nvidia corp in it. Can you change the device listing of CUDA to handle the event there is no drivers like OpenCL does?

bentsherman commented 5 years ago

@MitchsGreer Could you provide more context about how you are running your ACE application? Like what system are you using and how did you install your application? Also please provide a log of your error.

@4ctrl-alt-del where does OpenCL handle this case? It seems to me that if the drivers aren't installed then the application would fail immediately because it won't be able to link to the driver libraries.

4ctrl-alt-del commented 5 years ago

@bentsherman What I meant to say is a computer can have an OpenCL library installed without any drivers from any actual implementation because it is open source. Also failing to find a library is a very easy to understand error for the user. The issue that needs to be resolved with CUDA is ACE dying with the only output for the user to decipher being "unknown exception".

MitchsGreer commented 5 years ago

@bentsherman I installed ACE as the docs in KINC would have you do, I'm running on Ubuntu 18.04. I then installed KINC along with dependencies as the docs would have you do. I did not install the NVIDIA drivers, as I am not using CUDA in the analytic I am working on. When I would start the analytic after it ran through the initializing of the inputs and outputs it would crash saying "unknown exception caught". I tried setting the settings for the CUDA device to none and the error no longer appeared when I ran the analytic. I then attempted to set my CUDA device back to 0, as it was to begin with, and I got the exact same error: "Unknown exception caught!\n". It stands to reason that ACE was attempting to initiate some CUDA or talk to the CUDA device in some way, and was not able to, thus throwing an error.

bentsherman commented 5 years ago

@MitchsGreer Since ACE uses the CUDA Driver API, it is the driver itself that provides the library headers, so the NVIDIA drivers must be installed. On Ubuntu 18.04 there are a set of packages called nvidia-headless-* which provide a "dummy" CUDA driver API for situations exactly like yours.

Refer to the KINC Dockerfile: https://github.com/SystemsGenetics/KINC/blob/master/docker/Dockerfile#L16

@4ctrl-alt-del I have this case documented for KINC but if you have install docs for ACE you may want to update them accordingly. Basically the user needs to install either the headless NVIDIA driver or the actual NVIDIA driver.

bentsherman commented 5 years ago

@4ctrl-alt-del I also found this issue in the settings code which could be related? Not sure.

CUDA::Device* Settings::cudaDevicePointer() const
{
    // If the device index is out of range then return a null pointer, else return a
    // pointer to the device with the device index.
-   if ( _cudaDevice < 0 || _cudaDevice >= CUDA::Device::get(_cudaDevice)->size() )
+   if ( _cudaDevice < 0 || _cudaDevice >= CUDA::Device::size() )
    {
       return nullptr;
    }
4ctrl-alt-del commented 5 years ago

Good find, that would however not cause an "unknown exception". Looks like that was a strange typo from your original CUDA PR that I missed on the review.

If CUDA libraries are not present wouldn't the program fail at startup saying something like "cannot find libcuda.so" ? Something in CUDA is throwing an unknown exception that doesn't even follow the commonly accepted practice of using std::exception as a base class. I really can't help with this bug because I have zero knowledge of CUDA nor did I add it to ACE. At a bare minimum wherever this object is being thrown it needs to be caught and handled so the user can understand what is happening.

bentsherman commented 5 years ago

@4ctrl-alt-del Yes normally that is the error message that I get, although I think I've gotten the unknown exception message as well. I'm not sure what about @MitchsGreer 's setup is causing the latter rather than the former. But I encounter errors like this on Palmetto from time to time so I will keep an eye out during my KINC testing.

4ctrl-alt-del commented 5 years ago

To resolve this I think I am going to set the cuda default to none.

4ctrl-alt-del commented 5 years ago

Set default CUDA device to none in commit d89fcf9da9faffebb2251687b285f026c6c3217d.