Closed arclance closed 5 years ago
sudo make install TARGET_VER=325.15 PREFIX=/usr
nvidia-smi -q -a Mismatch in versions between nvidia-smi and NVML. Are you sure you are using nvidia-smi provided with the driver? Failed to properly shut down NVML: Function Not Found
How can we use it with the new driver 325.15? Thanks!
I just got back from vacation and upgraded my personal computer to 325.15... I'll work on a fix either today or tomorrow.
Works for me:
$ make clean $ sudo make install PREFIX=/usr TARGET_VER=325.15
Could you show me the output of "ls -l /usr/lib/libnvidia-ml*"? Maybe the symlinks got messed up.
$ make clean rm -f libnvidia-ml.so.1 rm -f libnvidia-ml.so.319.32
$ ls -l /usr/lib/libnvidia-ml* lrwxrwxrwx 1 root root 17 Aug 8 17:48 /usr/lib/libnvidia-ml.so -> libnvidia-ml.so.1 lrwxrwxrwx 1 root root 22 Aug 8 17:48 /usr/lib/libnvidia-ml.so.1 -> libnvidia-ml.so.325.15 -rwxr-xr-x 1 root root 550512 Aug 8 17:48 /usr/lib/libnvidia-ml.so.325.15
$ sudo make install PREFIX=/usr TARGET_VER=325.15 gcc -shared -fPIC empty.c -o libnvidia-ml.so.325.15 gcc -shared -fPIC -o libnvidia-ml.so.1 -DNVML_PATCH_319 -DNVML_VERSION=\"325.15\" libnvidia-ml.so.325.15 nvml_fix.c /usr/bin/install -D -Dm755 libnvidia-ml.so.1 /usr/lib/libnvidia-ml.so.1
$ ls -l /usr/lib/libnvidia-ml* lrwxrwxrwx 1 root root 17 Aug 8 17:48 /usr/lib/libnvidia-ml.so -> libnvidia-ml.so.1 -rwxr-xr-x 1 root root 12831 Aug 9 14:07 /usr/lib/libnvidia-ml.so.1 -rwxr-xr-x 1 root root 550512 Aug 8 17:48 /usr/lib/libnvidia-ml.so.325.15
$ nvidia-smi -q -a Failed to initialize NVML: Unknown Error Failed to properly shut down NVML: Function Not Found
Thanks!
@CFSworks That works for me as well.
@millecker Have you re-clone or updated the source on your computer since the build system update? If you have not doing that might help.
I tried it again with a fresh clone of your repository, but still same error:
$ nvidia-smi -q -a
Failed to initialize NVML: Unknown Error
Failed to properly shut down NVML: Function Not Found
And after a reboot I get following message:
$ nvidia-smi -a -q
Mismatch in versions between nvidia-smi and NVML.
Are you sure you are using nvidia-smi provided with the driver?
Failed to properly shut down NVML: Function Not Found
work for me with 325.15 ( fedora19 64bit kernel 3.10 + nvidia 325.15 )
fast and dirty ( without install)
git clone https://github.com/CFSworks/nvml_fix.git cd nvml_fix make TARGET_VER=325.15 rm libnvidia-ml.so.325.15 export LD_LIBRARY_PATH=$PWD:$LD_LIBRARY_PATH nvidia-smi
thank for this patch
getting
Failed to initialize NVML: Unknown Error Failed to properly shut down NVML: Function Not Found
too
/LE: deleting the generated libnvidia-ml.so.325.15 as @baoboa says and replacing it with the nVidia provided one fixes it; maybe you can rm libnvidia-ml.so.$(TARGET_VER) in the Makefile if the file is not actually needed?
Got that when i dont keep the original .so.325.15 lib
Le jeudi 5 septembre 2013, licaon-kter notifications@github.com a écrit :
getting
Failed to initialize NVML: Unknown Error Failed to properly shut down NVML: Function Not Found too
— Reply to this email directly or view it on GitHub.
I have tried this shim with a patched nvidia 325.15 driver (from Ubuntu's xorg-edgers repository) with both 3.11 and 3.12-rc3 kernels on 2 different machines, but I get the message:
"Mismatch in versions between nvidia-smi and NVML. Are you sure you are using nvidia-smi provided with the driver? Failed to properly shut down NVML: Function Not Found"
I am assuming this is because the patch interferes with the shim. Patch I used is here: http://leigh123linux.fedorapeople.org/pub/patches/kernel_v3.11.patch
The name is a misnomer.. it actually works with kernel 3.11 and the 3.12-rc versions also. Like the other poster in the other issue, I'm able to compile and run nvml_bug.c as:
Faraday:~/Desktop/nvml_fix-master$ gcc nvml_bug.c -o test -I. -L/usr/lib/nvidia-325 -lnvidia-ml Faraday:~/Desktop/nvml_fix-master$ optirun ./test
and I get the correct output after adding 325.15 to the version check list:
Found 1 device(s): Device 0, "GeForce GT 740M": ---- WITHOUT BUGFIX ---- Utilization: Not Supported Power usage: Not Supported ---- WITH BUGFIX ---- Utilization: 0% GPU, 0% MEM Power usage: Not Supported
I'm a bit at a loss as to why the shim (nvml_fix.c) doesn't work. The shim doesn't compile with v5 of nvml.h, but with v3 and v4 I get the same "Mismatch ... Not Found" output as above. When I compile and run nvml_bug.c, I'm able to do so with the correct output with v3, v4, and v5 of nvml.h.
That being said, that output alone might suit my purposes better than calling nvidia-smi since I just need to get the GPU and memory utilization for a particular CUDA code.
I use it fine with 3.11, did you manually copy the compiled libnvidia-ml.so.1 file thus overwriting the original libnvidia-ml.so.325.15 ( which is wrong ) or you first moved/renamed the original libnvidia-ml.so.1 out of the way and then placed the compiled one in place?
I renamed the libnvidia-ml.so.1 symlink in /usr/lib/nvidia-325 to libnvidia-ml.so.1.old and manually copied the libnvidia-ml.so.1 in the nvml_fix directory to /usr/lib/nvidia-325, did a chmod 777 on it just to be sure.
Guys, I'm having the same issue as @millecker both with 319.49 (kernel 3.8 and 3.5) and 319.37 (kernel 3.5). I'm not overwriting anything, just using the same trick as @baoboa to test things out.
Any idea what could be causing it and how to fix it?
I had the same issue as @millecker on Ubuntu 12.04 x86_64. Following up on the success report of @baoboa I tried compiling with gcc-4.4 (standard in Fedora 19) and that's what does the trick. gcc version that worked for me: gcc-4.4 (Ubuntu/Linaro 4.4.7-1ubuntu2) 4.4.7
@CFSworks please include info about tested gcc versions in the README file. Any idea why this is an issue in the first place?
I think the problem @millecker and others describe is caused by a change introduced with gcc 4.5 in Debian and Ubuntu. https://lists.ubuntu.com/archives/ubuntu-devel-announce/2010-October/000772.html
As a consequence gcc versions 4.5 and later pass "--as-needed" to the linker by default and the resulting libnvidia-ml.so.1 is not linked with libnvidia-ml.so.version. You can check that with 'ldd libnvidia-ml.so.1'.
Not sure what the best way to solve this is but the following works for me.
diff --git a/Makefile b/Makefile
index 00a8ca5..ee7ce5b 100644
--- a/Makefile
+++ b/Makefile
@@ -14,7 +14,7 @@ ${TARGET:1=${TARGET_VER}}: empty.c
${CC} ${CFLAGS} -shared -fPIC $(<) -o $(@)
$(TARGET): ${TARGET:1=${TARGET_VER}}
- ${CC} ${CFLAGS} -shared -fPIC -o $(@) -DNVML_PATCH_${TARGET_MAJOR} -DNVML_VERSION=\"$(TARGET_VER)\" $< nvml_fix.c
+ ${CC} ${CFLAGS} -Wl,--no-as-needed -shared -fPIC -o $(@) -DNVML_PATCH_${TARGET_MAJOR} -DNVML_VERSION=\"$(TARGET_VER)\" $< nvml_fix.c
So its a bug in Ubuntu and not in nvml_fix
Where do you get that from? It is a change of defaults, not a bug. And afaik not limited to Ubuntu.
I'm getting the same error with Scientific Linux 6.5 and Nvidia driver 331.62 (I know that patch is for other releases, but I'm trying..)
Failed to initialize NVML: Unknown Error Failed to properly shut down NVML: Function Not Found
I have copied both generated files (libnvidia-ml.so.1 and libnvidia-ml.so.331.62) to /usr/{lib,lib64}/nvidia and previously I have remove symbolic link but I continue getting that error...
And I need to know if a GPU (no Tesla) is running a process... How if nvidia-smi show "Compute Process: N/A" ???
Thanks.
@DanielRuizMolina the generated libnvidia-ml.so.331.62 is a dummy file, you're not supposed to copy that. Only libnvidia-ml.so.1 is needed. And also check where the nvidia driver installed the libnvidia* files. I'm pretty sure they are in /usr/lib/ and not /usr/lib/nvidia/. Last but not least, what's the output of 'ldd libnvidia-ml.so.1'?
Hi,
In Scientific Linux, libnvidia-ml.so.1 is owned by following
packages:yum provides */libnvidia-ml.so.1
Loaded plugins: refresh-packagekit, security
1:xorg-x11-drv-nvidia-libs-319.37-2.el6.i686 : Libraries for
xorg-x11-drv-nvidia
Repo : cuda
Matched from:
Filename : /usr/lib/nvidia/libnvidia-ml.so.1
1:xorg-x11-drv-nvidia-libs-331.62-2.el6.i686 : Libraries for
xorg-x11-drv-nvidia
Repo : cuda
Matched from:
Filename : /usr/lib/nvidia/libnvidia-ml.so.1
1:xorg-x11-drv-nvidia-libs-319.37-2.el6.x86_64 : Libraries for
xorg-x11-drv-nvidia
Repo : cuda
Matched from:
Filename : /usr/lib64/nvidia/libnvidia-ml.so.1
1:xorg-x11-drv-nvidia-libs-331.62-2.el6.x86_64 : Libraries for
xorg-x11-drv-nvidia
Repo : cuda
Matched from:
Filename : /usr/lib64/nvidia/libnvidia-ml.so.1
gpu-deployment-kit-331.62-0.x86_64 : NVIDIA® Cluster Management
Tools
Repo : cuda
Matched from:
Filename : /usr/src/gdk/nvml/lib/libnvidia-ml.so.1
1:xorg-x11-drv-nvidia-libs-331.62-2.el6.x86_64 : Libraries for
xorg-x11-drv-nvidia
Repo : installed
Matched from:
Filename : /usr/lib64/nvidia/libnvidia-ml.so.1
1:xorg-x11-drv-nvidia-libs-331.62-2.el6.i686 : Libraries for
xorg-x11-drv-nvidia
Repo : installed
Matched from:
Filename : /usr/lib/nvidia/libnvidia-ml.so.1
What I see with "ls":
folders /usr/lib/nvidia and /usr/lib64/nvidia:lrwxrwxrwx 1 root root 22 jul 16 12:42 libnvidia-ml.so
-> libnvidia-ml.so.331.62
lrwxrwxrwx 1 root root 22 jul 17 08:16 libnvidia-ml.so.1 ->
libnvidia-ml.so.331.62
-rwxr-xr-x 1 root root 543K mar 20 02:35 libnvidia-ml.so.331.62
Binary "nvidia-smi" is a 64bit executable:
file /usr/bin/nvidia-smi --> /usr/bin/nvidia-smi: ELF 64-bit LSB
executable, x86-64, version 1 (SYSV), dynamically linked (uses
shared libs), for GNU/Linux 2.4.0, stripped
What I get with "ldd":
folder /usr/lib/nvidia:
[root@MYSYSTEM nvidia]# ldd libnvidia-ml.so.1
linux-gate.so.1 => (0x00af1000)
libpthread.so.0 => /lib/libpthread.so.0 (0x00cd5000)
libdl.so.2 => /lib/libdl.so.2 (0x00c0d000)
libc.so.6 => /lib/libc.so.6 (0x0065b000)
/lib/ld-linux.so.2 (0x00ba5000)
folder /usr/local64/nvidia:
[root@MYSYSTEM nvidia]# ldd libnvidia-ml.so.1
linux-vdso.so.1 => (0x00007ffff98f1000)
libpthread.so.0 => /lib64/libpthread.so.0
(0x00007fd150879000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007fd150675000)
libc.so.6 => /lib64/libc.so.6 (0x00007fd1502e0000)
/lib64/ld-linux-x86-64.so.2 (0x00000037f8200000)
I have run theses steps:
cat Makefile:CC = gcc
CFLAGS =
TARGET_VER = 331.62# just set to a valid ver eg. one of:
325.08 325.15 319.32 319.23
#TARGET_VER = 325.15# just set to a valid ver eg. one of:
325.08 325.15 319.32 319.23
TARGET_MAJOR := $(shell echo ${TARGET_VER} | cut -d . --f=1)
TARGET = libnvidia-ml.so.1
DESTDIR = /
PREFIX = $(DESTDIR)usr
libdir = $(PREFIX)/lib
INSTALL = /usr/bin/install -D
all: $(TARGET)
${TARGET:1=${TARGET_VER}}: empty.c
${CC} ${CFLAGS} -shared -fPIC $(<) -o $(@)
$(TARGET): ${TARGET:1=${TARGET_VER}}
${CC} ${CFLAGS} -shared -fPIC -o $(@)
-DNVML_PATCH_${TARGET_MAJOR} -DNVML_VERSION=\"$(TARGET_VER)\"
$< nvml_fix.c
clean:
rm -f $(TARGET)
rm -f ${TARGET:1=${TARGET_VER}}
install: libnvidia-ml.so.1
$(INSTALL) -Dm755 $(^) $(libdir)/$(^)
.PHONY: clean install all
[root@MYSYSTEM nvml_fix-master]# make TARGET_VER=331.62
gcc -shared -fPIC empty.c -o libnvidia-ml.so.331.62
gcc -shared -fPIC -o libnvidia-ml.so.1 -DNVML_PATCH_331
-DNVML_VERSION=\"331.62\" libnvidia-ml.so.331.62 nvml_fix.c
[root@MYSYSTEM nvml_fix-master]# ldd ./libnvidia-ml.so.1
linux-vdso.so.1 => (0x00007fff2bcfc000)
libnvidia-ml.so.331.62 =>
/usr/lib64/nvidia/libnvidia-ml.so.331.62 (0x00007f12a16da000)
libc.so.6 => /lib64/libc.so.6 (0x00007f12a1333000)
libpthread.so.0 => /lib64/libpthread.so.0
(0x00007f12a1116000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f12a0f12000)
/lib64/ld-linux-x86-64.so.2 (0x00000037f8200000)
Now, if I delete the symbolic link "libnvidia-ml.so.1" in
/usr/lib64/nvidia and copy file created after "make", I can run
"nvidia-smi" with no problems, but I can't get process information:
[root@MYSYSTEM ~]# nvidia-smi
Thu Jul 17 08:34:35 2014
+------------------------------------------------------+
| NVIDIA-SMI 331.62 Driver Version: 331.62 |
|-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce 9500 GT Off | 0000:01:00.0 N/A | N/A | | 50% 41C N/A N/A / N/A | 78MiB / 1023MiB | N/A Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Compute processes: GPU Memory | | GPU PID Process name Usage | |=============================================================================| | 0 Not Supported | +-----------------------------------------------------------------------------+ Then, I launch two "cuda-hello-world" in background... but... [root@MYSYSTEM ~]# nvidia-smi -q ==============NVSMI LOG============== Timestamp : Thu Jul 17 08:34:43 2014 Driver Version : 331.62 Attached GPUs : 1 GPU 0000:01:00.0 Product Name : GeForce 9500 GT Display Mode : N/A Display Active : N/A Persistence Mode : Disabled Accounting Mode : N/A Accounting Mode Buffer Size : N/A Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-11086ef1-ff27-cbb6-8c62-53864d0332e1 Minor Number : 0 VBIOS Version : 62.94.4B.00.52 Inforom Version Image Version : N/A OEM Object : N/A ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A PCI Bus : 0x01 Device : 0x00 Domain : 0x0000 Device Id : 0x064010DE Bus Id : 0000:01:00.0 Sub System Id : 0x00000000 GPU Link Info PCIe Generation Max : N/A Current : N/A Link Width Max : N/A Current : N/A Bridge Chip Type : N/A Firmware : N/A Fan Speed : 50 % Performance State : N/A Clocks Throttle Reasons : N/A FB Memory Usage Total : 1023 MiB Used : 78 MiB Free : 945 MiB BAR1 Memory Usage Total : N/A Used : N/A Free : N/A Compute Mode : Default Utilization Gpu : N/A Memory : N/A Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Total : N/A Aggregate Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Total : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending : N/A Temperature Gpu : 41 C Power Readings Power Management : N/A Power Draw : N/A Power Limit : N/A Default Power Limit : N/A Enforced Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A Clocks Graphics : N/A SM : N/A Memory : N/A Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Max Clocks Graphics : N/A SM : N/A Memory : N/A Compute Processes : N/A So I continue with the same problem: I can get process information for GeForce GTX-* GPUs Could you help me? Thanks.El 16/07/2014 15:44, Stefan Fleischmann escribió:
@DanielRuizMolina the generated
libnvidia-ml.so.331.62 is a dummy file, you're not supposed to
copy that. Only libnvidia-ml.so.1 is needed. And also check
where the nvidia driver installed the libnvidia* files. I'm
pretty sure they are in /usr/lib/ and not /usr/lib/nvidia/. Last
but not least, what's the output of 'ldd libnvidia-ml.so.1'?
—
Reply to this email directly or view
it on GitHub.
closing due to age. if people are still experiencing issues with current driver versions on supported hardware (fermi or newer, see https://stackoverflow.com/questions/19761056/nvml-power-readings-with-nvmldevicegetpowerusage), please open a new issue, thanks.
The new stable driver 325.15 was released two days ago. Are any changes needed to use this fix with the new version other than specifying it's version at build time?