CFSworks / nvml_fix

A workaround for an annoying bug in nVidia's NVML library. Allows nvidia-smi to work once more!
98 stars 19 forks source link

920M GPU driver loss #36

Closed raddyfiy closed 3 years ago

raddyfiy commented 3 years ago

x64 ubuntu18.04 + gcc version 6.5.0 20181026+ Nvidia GeForce 920M+ driver 418.152.00

Before install nvml_fix:

nvidia-smi can show some info, and tensorflow can detect GPU, nvidia-settings program can show all the info. Like this:

image image

After:

(base) sheng$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

and tensorflow say can't find GPU driver. nvidia-settings program can't show anything. image image

Install command:

make TARGET_VER=418.152.00
(another shell) cd /usr/lib/x86_64-linux-gnu
sudo mv libnvidia-ml.so.1 libnvidia-ml.so.1_backup
sudo make install TARGET_VER=418.152.00 libdir=/usr/lib/x86_64-linux-gnu

After I find the problem, I want to recovery, rename the libnvidia-ml.so.1_backup to libnvidia-ml.so.1, but it still don't work. I don't know why, maybe there are other change files I don't know?

This is log file: nvidia-libs.txt nvidia-smi.txt nvidia-smi-ltrace.txt nvidia-smi-strace.txt Thank you

tofurky commented 3 years ago

hey, can you please attach strace + ltrace of it without nvml_fix installed (with it working as normal)?

tofurky commented 3 years ago

also maybe try this with nvml_fix still installed:

ls -l /dev/nvidiactl

if it does not exist:

sudo mknod /dev/nvidiactl c 195 255
sudo chmod 666 /dev/nvidiactl
nvidia-smi

from the debug output you attached, it appears to not exist

tofurky commented 3 years ago

i can get the same error if i remove /dev/nvidiactl:

matt@aquos:~$ nvidia-smi
Sat Jan  9 03:22:09 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 650...  Off  | 00000000:01:00.0  On |                  N/A |
| 30%   30C    P8    18W / 144W |    301MiB /  1991MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       703      G   /usr/lib/xorg/Xorg                198MiB |
|    0   N/A  N/A      3624      G   ...gAAAAAAAAA --shared-files      107MiB |
+-----------------------------------------------------------------------------+
matt@aquos:~$ sudo rm /dev/nvidiactl 
matt@aquos:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
raddyfiy commented 3 years ago

I install the system again, now it's normal, and this is strace + ltrace of it without nvml_fix installed:

$ ls -l /dev/nvidiactl
crw-rw-rw- 1 root root 195, 255 1月  10  2021 /dev/nvidiactl

nvidia-smi-ltrace.txt nvidia-smi-strace.txt

raddyfiy commented 3 years ago

So the problem perhaps is "nvidiactl" disappeared. But I never delete it, so a software in use deleted it?

tofurky commented 3 years ago

not sure, but i don't think it related to nvml_fix.

on my system /lib/udev/rules.d/71-nvidia.rules has this:

# This will create the device nvidia device nodes
ACTION=="add", DEVPATH=="/bus/pci/drivers/nvidia", RUN+="/sbin/ub-device-create"

so /dev/nvidiactl should be automatically created at boot. it is also created if nvidia-smi is run as root.

raddyfiy commented 3 years ago

I install nvml_fix again, and this time it's working. it's a unknow problem, so maybe there are another programs make it. Although it still doesn't show process, I check it and find that my GPU doesn't support to get process. Nvml adds a Memory percentage successfully. So thank you very much to help me :)

tofurky commented 3 years ago

glad it's working now. please update if you figure out the reason /dev/nvidiactl didn't exist.