Closed raddyfiy closed 3 years ago
hey, can you please attach strace + ltrace of it without nvml_fix installed (with it working as normal)?
also maybe try this with nvml_fix still installed:
ls -l /dev/nvidiactl
if it does not exist:
sudo mknod /dev/nvidiactl c 195 255
sudo chmod 666 /dev/nvidiactl
nvidia-smi
from the debug output you attached, it appears to not exist
i can get the same error if i remove /dev/nvidiactl:
matt@aquos:~$ nvidia-smi
Sat Jan 9 03:22:09 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 650... Off | 00000000:01:00.0 On | N/A |
| 30% 30C P8 18W / 144W | 301MiB / 1991MiB | 4% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 703 G /usr/lib/xorg/Xorg 198MiB |
| 0 N/A N/A 3624 G ...gAAAAAAAAA --shared-files 107MiB |
+-----------------------------------------------------------------------------+
matt@aquos:~$ sudo rm /dev/nvidiactl
matt@aquos:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
I install the system again, now it's normal, and this is strace + ltrace of it without nvml_fix installed:
$ ls -l /dev/nvidiactl
crw-rw-rw- 1 root root 195, 255 1月 10 2021 /dev/nvidiactl
So the problem perhaps is "nvidiactl" disappeared. But I never delete it, so a software in use deleted it?
not sure, but i don't think it related to nvml_fix.
on my system /lib/udev/rules.d/71-nvidia.rules has this:
# This will create the device nvidia device nodes
ACTION=="add", DEVPATH=="/bus/pci/drivers/nvidia", RUN+="/sbin/ub-device-create"
so /dev/nvidiactl should be automatically created at boot. it is also created if nvidia-smi is run as root.
I install nvml_fix again, and this time it's working. it's a unknow problem, so maybe there are another programs make it. Although it still doesn't show process, I check it and find that my GPU doesn't support to get process. Nvml adds a Memory percentage successfully. So thank you very much to help me :)
glad it's working now. please update if you figure out the reason /dev/nvidiactl didn't exist.
x64 ubuntu18.04 + gcc version 6.5.0 20181026+ Nvidia GeForce 920M+ driver 418.152.00
Before install nvml_fix:
nvidia-smi can show some info, and tensorflow can detect GPU, nvidia-settings program can show all the info. Like this:
After:
and tensorflow say can't find GPU driver. nvidia-settings program can't show anything.
Install command:
After I find the problem, I want to recovery, rename the libnvidia-ml.so.1_backup to libnvidia-ml.so.1, but it still don't work. I don't know why, maybe there are other change files I don't know?
This is log file: nvidia-libs.txt nvidia-smi.txt nvidia-smi-ltrace.txt nvidia-smi-strace.txt Thank you