deepgram / kur

Descriptive Deep Learning
Apache License 2.0
814 stars 107 forks source link

Failed to get device utilization #80

Closed ghost closed 6 years ago

ghost commented 7 years ago

I'm running Kur on a friend's laptop, powered by an Nvidia GT740M. When using NVIDIA-SMI to get device utilization or listing apps using it, the SMI tool reports 'Unsupported. After searching I found out the driver does not support this function on certain devices. So when running Kur with Tensoflow-GPU I faced that error, yet, by commenting out the line in kur package the library is working, using my GPU. Maybe you should check that, not failing when the device is not supported to get it's utilization.

nivedita commented 7 years ago

@Medisex64 could you please tell which part of the code(file) you commented. I am running into similar issue. kur fails to detect my gpu. Thanks so much!

ghost commented 6 years ago

I opened this file in my conda environment: https://github.com/deepgram/kur/blob/master/kur/utils/cuda.py Searched for Failed to get device utilization and commented these 2 lines:

if func(device.handle, ctypes.byref(result)):
     raise CudaError('Failed to get device utilization.')

As I mentioned before, it seems this function is only used to get device utilization, maybe for Kur to better manage the work., yet commenting it out did not make it worse, except for getting Kur to work!

If you look at the error produced when running Kur, you'll get the full path to the file you should edit, the file lies in you'r conda's site packages.

ajsyp commented 6 years ago

Kur definitely uses that sort of utilization estimate to determine how to spread out workload in multi-GPU systems. It attempts to grab utilization--as you point out--in cuda.py, which it then tries to use in backend.py.

I'm open to different solutions here, provided that Kur can still intelligently auto-assign in multi-GPU environments. Maybe the simplest thing would be something like:

settings:
  backend:
    skip_check: yes

Setting skip_check (or we could call it skip_utilization_check or assume_not_busy or ...) to True would tell Kur to "make it's best guess about utilization, but if that fails, just assume that all GPUs are not being used, and grab them." Thoughts?

nivedita commented 6 years ago

@Medisex64 thanks. I had commented out backend.py before but that did not work, as the request comes from cuda.py. Thanks so much for your help. Working now. :)

ajsyp commented 6 years ago

Just for the record, commit d0b632b does, in fact, add a skip_check to the backend configuration, which can be used to skip the utilization check by setting it to "yes".