Dasharo / dasharo-issues

The Dasharo issue tracker
https://dasharo.com/
25 stars 0 forks source link

CUDA doesn't work on MSI boards #1135

Open frantic-from-paranoiawf opened 1 week ago

frantic-from-paranoiawf commented 1 week ago

Component

Dasharo firmware

Device

MSI Pro Z790-P

Dasharo version

Latest dasharo branch

Dasharo Tools Suite version

No response

Test case ID

No response

Brief summary

Nvidia CUDA with Torch does not work on MSI boards. It fails to initialize. Display does work though.

How reproducible

100% of the time.

How to reproduce

  1. Install proprietary Nvidia drivers
  2. Insert RTX card into top slot
  3. Boot into Linux with display connected to the Nvidia card
  4. Try and load a program that uses CUDA (e.g. oobabooga)

Expected behavior

Torch + CUDA program starts with no errors, can use CUDA for deep-learning successfully.

Actual behavior

Torch CUDA program returns the following error:

RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu

Screenshots

image

Additional context

Option ROMs loading in UEFI is enabled. AI program is oobabooga Using Gentoo Linux (all dependencies installed, contained, and ran inside of a Python venv, Gentoo isn't the cause) Display output and CUDA works perfectly and quickly on vendor firmware X11 is being used

Solutions you've tried

Loading Nvidia driver explicitly first in xorg.conf Disabling Option ROMs loading in firmware Recloning oobabooga and redownloading all dependencies Enabling resizeable bars inside of firmware

frantic-from-paranoiawf commented 1 week ago

Closing this as it has been solved. From what it seems (I don't know why), Kernel-open modules work on vendor firmware with CUDA, but you will need proprietary modules for CUDA on Dasharo. That might not be the case, but it's what fixed it for me.

miczyg1 commented 1 week ago

If CUDA doesn't work with open drivers it is still valid issue. From the coreboot matrix channel I saw that it work on any drivers with MSI firmware.

renehoj commented 1 week ago

I'm getting the same error using Qubes OS, this command solves the problem for me. sudo nvidia-smi --id 000:00:01.0 --persistence-mode 1