CFSworks / nvml_fix

A workaround for an annoying bug in nVidia's NVML library. Allows nvidia-smi to work once more!
98 stars 19 forks source link

New driver version 325.15 #3

Closed arclance closed 5 years ago

arclance commented 11 years ago

The new stable driver 325.15 was released two days ago. Are any changes needed to use this fix with the new version other than specifying it's version at build time?

millecker commented 11 years ago

sudo make install TARGET_VER=325.15 PREFIX=/usr

nvidia-smi -q -a Mismatch in versions between nvidia-smi and NVML. Are you sure you are using nvidia-smi provided with the driver? Failed to properly shut down NVML: Function Not Found

How can we use it with the new driver 325.15? Thanks!

CFSworks commented 11 years ago

I just got back from vacation and upgraded my personal computer to 325.15... I'll work on a fix either today or tomorrow.

CFSworks commented 11 years ago

Works for me:

$ make clean $ sudo make install PREFIX=/usr TARGET_VER=325.15

Could you show me the output of "ls -l /usr/lib/libnvidia-ml*"? Maybe the symlinks got messed up.

millecker commented 11 years ago

$ make clean rm -f libnvidia-ml.so.1 rm -f libnvidia-ml.so.319.32

$ ls -l /usr/lib/libnvidia-ml* lrwxrwxrwx 1 root root 17 Aug 8 17:48 /usr/lib/libnvidia-ml.so -> libnvidia-ml.so.1 lrwxrwxrwx 1 root root 22 Aug 8 17:48 /usr/lib/libnvidia-ml.so.1 -> libnvidia-ml.so.325.15 -rwxr-xr-x 1 root root 550512 Aug 8 17:48 /usr/lib/libnvidia-ml.so.325.15

$ sudo make install PREFIX=/usr TARGET_VER=325.15 gcc -shared -fPIC empty.c -o libnvidia-ml.so.325.15 gcc -shared -fPIC -o libnvidia-ml.so.1 -DNVML_PATCH_319 -DNVML_VERSION=\"325.15\" libnvidia-ml.so.325.15 nvml_fix.c /usr/bin/install -D -Dm755 libnvidia-ml.so.1 /usr/lib/libnvidia-ml.so.1

$ ls -l /usr/lib/libnvidia-ml* lrwxrwxrwx 1 root root 17 Aug 8 17:48 /usr/lib/libnvidia-ml.so -> libnvidia-ml.so.1 -rwxr-xr-x 1 root root 12831 Aug 9 14:07 /usr/lib/libnvidia-ml.so.1 -rwxr-xr-x 1 root root 550512 Aug 8 17:48 /usr/lib/libnvidia-ml.so.325.15

$ nvidia-smi -q -a Failed to initialize NVML: Unknown Error Failed to properly shut down NVML: Function Not Found

Thanks!

arclance commented 11 years ago

@CFSworks That works for me as well.

@millecker Have you re-clone or updated the source on your computer since the build system update? If you have not doing that might help.

millecker commented 11 years ago

I tried it again with a fresh clone of your repository, but still same error:

$ nvidia-smi -q -a
Failed to initialize NVML: Unknown Error
Failed to properly shut down NVML: Function Not Found

And after a reboot I get following message:

$ nvidia-smi -a -q
Mismatch in versions between nvidia-smi and NVML.
Are you sure you are using nvidia-smi provided with the driver?
Failed to properly shut down NVML: Function Not Found
baoboa commented 11 years ago

work for me with 325.15 ( fedora19 64bit kernel 3.10 + nvidia 325.15 )

fast and dirty ( without install)

git clone https://github.com/CFSworks/nvml_fix.git cd nvml_fix make TARGET_VER=325.15 rm libnvidia-ml.so.325.15 export LD_LIBRARY_PATH=$PWD:$LD_LIBRARY_PATH nvidia-smi

thank for this patch

licaon-kter commented 11 years ago

getting

Failed to initialize NVML: Unknown Error Failed to properly shut down NVML: Function Not Found

too

/LE: deleting the generated libnvidia-ml.so.325.15 as @baoboa says and replacing it with the nVidia provided one fixes it; maybe you can rm libnvidia-ml.so.$(TARGET_VER) in the Makefile if the file is not actually needed?

baoboa commented 11 years ago

Got that when i dont keep the original .so.325.15 lib

Le jeudi 5 septembre 2013, licaon-kter notifications@github.com a écrit :

getting

Failed to initialize NVML: Unknown Error Failed to properly shut down NVML: Function Not Found too

— Reply to this email directly or view it on GitHub.

vacaloca commented 11 years ago

I have tried this shim with a patched nvidia 325.15 driver (from Ubuntu's xorg-edgers repository) with both 3.11 and 3.12-rc3 kernels on 2 different machines, but I get the message:

"Mismatch in versions between nvidia-smi and NVML. Are you sure you are using nvidia-smi provided with the driver? Failed to properly shut down NVML: Function Not Found"

I am assuming this is because the patch interferes with the shim. Patch I used is here: http://leigh123linux.fedorapeople.org/pub/patches/kernel_v3.11.patch

The name is a misnomer.. it actually works with kernel 3.11 and the 3.12-rc versions also. Like the other poster in the other issue, I'm able to compile and run nvml_bug.c as:

Faraday:~/Desktop/nvml_fix-master$ gcc nvml_bug.c -o test -I. -L/usr/lib/nvidia-325 -lnvidia-ml Faraday:~/Desktop/nvml_fix-master$ optirun ./test

and I get the correct output after adding 325.15 to the version check list:

Found 1 device(s): Device 0, "GeForce GT 740M": ---- WITHOUT BUGFIX ---- Utilization: Not Supported Power usage: Not Supported ---- WITH BUGFIX ---- Utilization: 0% GPU, 0% MEM Power usage: Not Supported

I'm a bit at a loss as to why the shim (nvml_fix.c) doesn't work. The shim doesn't compile with v5 of nvml.h, but with v3 and v4 I get the same "Mismatch ... Not Found" output as above. When I compile and run nvml_bug.c, I'm able to do so with the correct output with v3, v4, and v5 of nvml.h.

That being said, that output alone might suit my purposes better than calling nvidia-smi since I just need to get the GPU and memory utilization for a particular CUDA code.

licaon-kter commented 11 years ago

I use it fine with 3.11, did you manually copy the compiled libnvidia-ml.so.1 file thus overwriting the original libnvidia-ml.so.325.15 ( which is wrong ) or you first moved/renamed the original libnvidia-ml.so.1 out of the way and then placed the compiled one in place?

vacaloca commented 11 years ago

I renamed the libnvidia-ml.so.1 symlink in /usr/lib/nvidia-325 to libnvidia-ml.so.1.old and manually copied the libnvidia-ml.so.1 in the nvml_fix directory to /usr/lib/nvidia-325, did a chmod 777 on it just to be sure.

pszi1ard commented 11 years ago

Guys, I'm having the same issue as @millecker both with 319.49 (kernel 3.8 and 3.5) and 319.37 (kernel 3.5). I'm not overwriting anything, just using the same trick as @baoboa to test things out.

Any idea what could be causing it and how to fix it?

stiobhan commented 10 years ago

I had the same issue as @millecker on Ubuntu 12.04 x86_64. Following up on the success report of @baoboa I tried compiling with gcc-4.4 (standard in Fedora 19) and that's what does the trick. gcc version that worked for me: gcc-4.4 (Ubuntu/Linaro 4.4.7-1ubuntu2) 4.4.7

@CFSworks please include info about tested gcc versions in the README file. Any idea why this is an issue in the first place?

stiobhan commented 10 years ago

I think the problem @millecker and others describe is caused by a change introduced with gcc 4.5 in Debian and Ubuntu. https://lists.ubuntu.com/archives/ubuntu-devel-announce/2010-October/000772.html

As a consequence gcc versions 4.5 and later pass "--as-needed" to the linker by default and the resulting libnvidia-ml.so.1 is not linked with libnvidia-ml.so.version. You can check that with 'ldd libnvidia-ml.so.1'.

Not sure what the best way to solve this is but the following works for me.

diff --git a/Makefile b/Makefile
index 00a8ca5..ee7ce5b 100644
--- a/Makefile
+++ b/Makefile
@@ -14,7 +14,7 @@ ${TARGET:1=${TARGET_VER}}: empty.c
        ${CC} ${CFLAGS} -shared -fPIC $(<) -o $(@) 

 $(TARGET): ${TARGET:1=${TARGET_VER}}
-       ${CC} ${CFLAGS} -shared -fPIC -o $(@) -DNVML_PATCH_${TARGET_MAJOR} -DNVML_VERSION=\"$(TARGET_VER)\" $< nvml_fix.c
+       ${CC} ${CFLAGS} -Wl,--no-as-needed -shared -fPIC -o $(@) -DNVML_PATCH_${TARGET_MAJOR} -DNVML_VERSION=\"$(TARGET_VER)\" $< nvml_fix.c
Thaodan commented 10 years ago

So its a bug in Ubuntu and not in nvml_fix

stiobhan commented 10 years ago

Where do you get that from? It is a change of defaults, not a bug. And afaik not limited to Ubuntu.

DanielRuizMolina commented 10 years ago

I'm getting the same error with Scientific Linux 6.5 and Nvidia driver 331.62 (I know that patch is for other releases, but I'm trying..)

Failed to initialize NVML: Unknown Error Failed to properly shut down NVML: Function Not Found

I have copied both generated files (libnvidia-ml.so.1 and libnvidia-ml.so.331.62) to /usr/{lib,lib64}/nvidia and previously I have remove symbolic link but I continue getting that error...

And I need to know if a GPU (no Tesla) is running a process... How if nvidia-smi show "Compute Process: N/A" ???

Thanks.

stiobhan commented 10 years ago

@DanielRuizMolina the generated libnvidia-ml.so.331.62 is a dummy file, you're not supposed to copy that. Only libnvidia-ml.so.1 is needed. And also check where the nvidia driver installed the libnvidia* files. I'm pretty sure they are in /usr/lib/ and not /usr/lib/nvidia/. Last but not least, what's the output of 'ldd libnvidia-ml.so.1'?

DanielRuizMolina commented 10 years ago
Hi,
In Scientific Linux, libnvidia-ml.so.1 is owned by following
packages:yum provides */libnvidia-ml.so.1
  Loaded plugins: refresh-packagekit, security
  1:xorg-x11-drv-nvidia-libs-319.37-2.el6.i686 : Libraries for
  xorg-x11-drv-nvidia
  Repo        : cuda
  Matched from:
  Filename    : /usr/lib/nvidia/libnvidia-ml.so.1
  1:xorg-x11-drv-nvidia-libs-331.62-2.el6.i686 : Libraries for
  xorg-x11-drv-nvidia
  Repo        : cuda
  Matched from:
  Filename    : /usr/lib/nvidia/libnvidia-ml.so.1
  1:xorg-x11-drv-nvidia-libs-319.37-2.el6.x86_64 : Libraries for
  xorg-x11-drv-nvidia
  Repo        : cuda
  Matched from:
  Filename    : /usr/lib64/nvidia/libnvidia-ml.so.1
  1:xorg-x11-drv-nvidia-libs-331.62-2.el6.x86_64 : Libraries for
  xorg-x11-drv-nvidia
  Repo        : cuda
  Matched from:
  Filename    : /usr/lib64/nvidia/libnvidia-ml.so.1
  gpu-deployment-kit-331.62-0.x86_64 : NVIDIA® Cluster Management
  Tools
  Repo        : cuda
  Matched from:
  Filename    : /usr/src/gdk/nvml/lib/libnvidia-ml.so.1
  1:xorg-x11-drv-nvidia-libs-331.62-2.el6.x86_64 : Libraries for
  xorg-x11-drv-nvidia
  Repo        : installed
  Matched from:
  Filename    : /usr/lib64/nvidia/libnvidia-ml.so.1
  1:xorg-x11-drv-nvidia-libs-331.62-2.el6.i686 : Libraries for
  xorg-x11-drv-nvidia
  Repo        : installed
  Matched from:
  Filename    : /usr/lib/nvidia/libnvidia-ml.so.1

What I see with "ls":
folders /usr/lib/nvidia and /usr/lib64/nvidia:lrwxrwxrwx 1 root root   22 jul 16 12:42 libnvidia-ml.so
  -> libnvidia-ml.so.331.62
  lrwxrwxrwx 1 root root   22 jul 17 08:16 libnvidia-ml.so.1 ->
  libnvidia-ml.so.331.62
  -rwxr-xr-x 1 root root 543K mar 20 02:35 libnvidia-ml.so.331.62
Binary "nvidia-smi" is a 64bit executable:
file /usr/bin/nvidia-smi --> /usr/bin/nvidia-smi: ELF 64-bit LSB
executable, x86-64, version 1 (SYSV), dynamically linked (uses
shared libs), for GNU/Linux 2.4.0, stripped
What I get with "ldd":
folder /usr/lib/nvidia:
[root@MYSYSTEM nvidia]# ldd libnvidia-ml.so.1
        linux-gate.so.1 =>  (0x00af1000)
        libpthread.so.0 => /lib/libpthread.so.0 (0x00cd5000)
        libdl.so.2 => /lib/libdl.so.2 (0x00c0d000)
        libc.so.6 => /lib/libc.so.6 (0x0065b000)
        /lib/ld-linux.so.2 (0x00ba5000)
folder /usr/local64/nvidia:
[root@MYSYSTEM nvidia]# ldd libnvidia-ml.so.1
        linux-vdso.so.1 =>  (0x00007ffff98f1000)
        libpthread.so.0 => /lib64/libpthread.so.0
(0x00007fd150879000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fd150675000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fd1502e0000)
        /lib64/ld-linux-x86-64.so.2 (0x00000037f8200000)
I have run theses steps:
cat Makefile:CC            = gcc
  CFLAGS        =
  TARGET_VER    = 331.62# just set to a valid ver eg. one of: 
  325.08 325.15 319.32 319.23
  #TARGET_VER    = 325.15# just set to a valid ver eg. one of: 
  325.08 325.15 319.32 319.23
  TARGET_MAJOR := $(shell echo ${TARGET_VER} | cut -d . --f=1)
  TARGET        = libnvidia-ml.so.1
  DESTDIR       = /
  PREFIX        = $(DESTDIR)usr
  libdir        = $(PREFIX)/lib
  INSTALL       = /usr/bin/install -D
  all: $(TARGET)
  ${TARGET:1=${TARGET_VER}}: empty.c
          ${CC} ${CFLAGS} -shared -fPIC $(<) -o $(@)
  $(TARGET): ${TARGET:1=${TARGET_VER}}
          ${CC} ${CFLAGS} -shared -fPIC -o $(@)
  -DNVML_PATCH_${TARGET_MAJOR} -DNVML_VERSION=\"$(TARGET_VER)\"
  $< nvml_fix.c
  clean:
          rm -f $(TARGET)
          rm -f ${TARGET:1=${TARGET_VER}}
  install: libnvidia-ml.so.1
          $(INSTALL) -Dm755 $(^) $(libdir)/$(^)
  .PHONY: clean install all
[root@MYSYSTEM nvml_fix-master]# make TARGET_VER=331.62
gcc  -shared -fPIC empty.c -o libnvidia-ml.so.331.62
gcc  -shared -fPIC -o libnvidia-ml.so.1 -DNVML_PATCH_331
-DNVML_VERSION=\"331.62\" libnvidia-ml.so.331.62 nvml_fix.c
[root@MYSYSTEM nvml_fix-master]# ldd ./libnvidia-ml.so.1
        linux-vdso.so.1 =>  (0x00007fff2bcfc000)
        libnvidia-ml.so.331.62 =>
/usr/lib64/nvidia/libnvidia-ml.so.331.62 (0x00007f12a16da000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f12a1333000)
        libpthread.so.0 => /lib64/libpthread.so.0
(0x00007f12a1116000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f12a0f12000)
        /lib64/ld-linux-x86-64.so.2 (0x00000037f8200000)
Now, if I delete the symbolic link "libnvidia-ml.so.1" in
/usr/lib64/nvidia and copy file created after "make", I can run
"nvidia-smi" with no problems, but I can't get process information:
[root@MYSYSTEM ~]# nvidia-smi
Thu Jul 17 08:34:35 2014
+------------------------------------------------------+
| NVIDIA-SMI 331.62     Driver Version: 331.62         |

|-------------------------------+----------------------+----------------------+ | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC | | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. | |===============================+======================+======================| |   0  GeForce 9500 GT     Off  | 0000:01:00.0     N/A |                  N/A | | 50%   41C  N/A     N/A /  N/A |     78MiB /  1023MiB |     N/A      Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Compute processes:                                               GPU Memory | |  GPU       PID  Process name                                     Usage      | |=============================================================================| |    0            Not Supported                                               | +-----------------------------------------------------------------------------+ Then, I launch two "cuda-hello-world" in background... but... [root@MYSYSTEM ~]# nvidia-smi -q ==============NVSMI LOG============== Timestamp                           : Thu Jul 17 08:34:43 2014 Driver Version                      : 331.62 Attached GPUs                       : 1 GPU 0000:01:00.0     Product Name                    : GeForce 9500 GT     Display Mode                    : N/A     Display Active                  : N/A     Persistence Mode                : Disabled     Accounting Mode                 : N/A     Accounting Mode Buffer Size     : N/A     Driver Model         Current                     : N/A         Pending                     : N/A     Serial Number                   : N/A     GPU UUID                        : GPU-11086ef1-ff27-cbb6-8c62-53864d0332e1     Minor Number                    : 0     VBIOS Version                   : 62.94.4B.00.52     Inforom Version         Image Version               : N/A         OEM Object                  : N/A         ECC Object                  : N/A         Power Management Object     : N/A     GPU Operation Mode         Current                     : N/A         Pending                     : N/A     PCI         Bus                         : 0x01         Device                      : 0x00         Domain                      : 0x0000         Device Id                   : 0x064010DE         Bus Id                      : 0000:01:00.0         Sub System Id               : 0x00000000         GPU Link Info             PCIe Generation                 Max                 : N/A                 Current             : N/A             Link Width                 Max                 : N/A                 Current             : N/A         Bridge Chip             Type                    : N/A             Firmware                : N/A     Fan Speed                       : 50 %     Performance State               : N/A     Clocks Throttle Reasons         : N/A     FB Memory Usage         Total                       : 1023 MiB         Used                        : 78 MiB         Free                        : 945 MiB     BAR1 Memory Usage         Total                       : N/A         Used                        : N/A         Free                        : N/A     Compute Mode                    : Default     Utilization         Gpu                         : N/A         Memory                      : N/A     Ecc Mode         Current                     : N/A         Pending                     : N/A     ECC Errors         Volatile             Single Bit                 Device Memory       : N/A                 Register File       : N/A                 L1 Cache            : N/A                 L2 Cache            : N/A                 Texture Memory      : N/A                 Total               : N/A             Double Bit                 Device Memory       : N/A                 Register File       : N/A                 L1 Cache            : N/A                 L2 Cache            : N/A                 Texture Memory      : N/A                 Total               : N/A         Aggregate             Single Bit                 Device Memory       : N/A                 Register File       : N/A                 L1 Cache            : N/A                 L2 Cache            : N/A                 Texture Memory      : N/A                 Total               : N/A             Double Bit                 Device Memory       : N/A                 Register File       : N/A                 L1 Cache            : N/A                 L2 Cache            : N/A                 Texture Memory      : N/A                 Total               : N/A     Retired Pages         Single Bit ECC              : N/A         Double Bit ECC              : N/A         Pending                     : N/A     Temperature         Gpu                         : 41 C     Power Readings         Power Management            : N/A         Power Draw                  : N/A         Power Limit                 : N/A         Default Power Limit         : N/A         Enforced Power Limit        : N/A         Min Power Limit             : N/A         Max Power Limit             : N/A     Clocks         Graphics                    : N/A         SM                          : N/A         Memory                      : N/A     Applications Clocks         Graphics                    : N/A         Memory                      : N/A     Default Applications Clocks         Graphics                    : N/A         Memory                      : N/A     Max Clocks         Graphics                    : N/A         SM                          : N/A         Memory                      : N/A     Compute Processes               : N/A So I continue with the same problem: I can get process information for GeForce GTX-* GPUs Could you help me? Thanks.El 16/07/2014 15:44, Stefan Fleischmann escribió:

  @DanielRuizMolina the generated
    libnvidia-ml.so.331.62 is a dummy file, you're not supposed to
    copy that. Only libnvidia-ml.so.1 is needed. And also check
    where the nvidia driver installed the libnvidia* files. I'm
    pretty sure they are in /usr/lib/ and not /usr/lib/nvidia/. Last
    but not least, what's the output of 'ldd libnvidia-ml.so.1'?
  —
    Reply to this email directly or view
      it on GitHub.
tofurky commented 5 years ago

closing due to age. if people are still experiencing issues with current driver versions on supported hardware (fermi or newer, see https://stackoverflow.com/questions/19761056/nvml-power-readings-with-nvmldevicegetpowerusage), please open a new issue, thanks.