CFSworks / nvml_fix

A workaround for an annoying bug in nVidia's NVML library. Allows nvidia-smi to work once more!
98 stars 19 forks source link

Can't run nvidia-smi on 64-bit Ubuntu 13.04 #2

Open mrj10 opened 11 years ago

mrj10 commented 11 years ago

OS: Ubuntu 13.04, 64-bit Kernel: 3.8.0-27 NVIDIA Driver: 319.32

I can build libnvidia-ml.so.1 just fine, but when I install it and try to run nvidia-smi, I get:

Mismatch in versions between nvidia-smi and NVML. Are you sure you are using nvidia-smi provided with the driver? Failed to properly shut down NVML: Function Not Found

I have tried:

Curiously, I am able to compile and run http://cfsworks.com/files/downloads/nvml_bug.c successfully.

Any insight would be greatly appreciated. Thanks!

CFSworks commented 11 years ago

Hi! Sorry for the delay; I was on vacation until today.

Could you show me the output of ls -l /usr/lib/libnvidia-ml*?

mrj10 commented 11 years ago

No worries :)

Before nvml_fix installation:

$ ls -l /usr/lib/libnvidia-ml* lrwxrwxrwx 1 root root 17 Aug 4 14:45 /usr/lib/libnvidia-ml.so -> libnvidia-ml.so.1 lrwxrwxrwx 1 root root 22 Aug 4 14:52 /usr/lib/libnvidia-ml.so.1 -> libnvidia-ml.so.319.32 -rwxr-xr-x 1 root root 549936 Aug 4 14:45 /usr/lib/libnvidia-ml.so.319.32

Installation:

$ sudo make install TARGET_VER=319.32 PREFIX=/usr /usr/bin/install -D -Dm755 libnvidia-ml.so.1 /usr/lib/libnvidia-ml.so.1

After:

$ ls -l libnvidia-ml* lrwxrwxrwx 1 root root 17 Aug 4 14:45 libnvidia-ml.so -> libnvidia-ml.so.1 -rwxr-xr-x 1 root root 12965 Aug 8 22:15 libnvidia-ml.so.1 -rwxr-xr-x 1 root root 549936 Aug 4 14:45 libnvidia-ml.so.319.32

Try to use nvidia-smi:

$ nvidia-smi Mismatch in versions between nvidia-smi and NVML. Are you sure you are using nvidia-smi provided with the driver? Failed to properly shut down NVML: Function Not Found

In case it helps, a trace of the relevant syscalls when opening the library:

$ strace nvidia-smi 2>&1 | grep -n -A10 nvidia-ml 127:open("tls/x86_64/libnvidia-ml.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) 128:open("tls/libnvidia-ml.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) 129:open("x86_64/libnvidia-ml.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) 130:open("libnvidia-ml.so.1", O_RDONLY|O_CLOEXEC) = 3 131-read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\20\10\0\0\0\0\0\0"..., 832) = 832 132-fstat(3, {st_mode=S_IFREG|0755, st_size=12965, ...}) = 0 133-getcwd("/usr/lib", 128) = 9 134-mmap(NULL, 2105488, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7ffd900f8000 135-mprotect(0x7ffd900fa000, 2093056, PROT_NONE) = 0 136-mmap(0x7ffd902f9000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1000) = 0x7ffd902f9000 137-close(3) = 0 138-mprotect(0x7ffd902f9000, 4096, PROT_READ) = 0 139:open("tls/x86_64/libnvidia-ml.so.319.32", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) 140:open("tls/libnvidia-ml.so.319.32", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) 141:open("x86_64/libnvidia-ml.so.319.32", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) 142:open("libnvidia-ml.so.319.32", O_RDONLY|O_CLOEXEC) = 3 143-read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300d\0\0\0\0\0\0"..., 832) = 832 144-fstat(3, {st_mode=S_IFREG|0755, st_size=549936, ...}) = 0 145-getcwd("/usr/lib", 128) = 9 146-mmap(NULL, 2694976, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7ffd8fe66000 147-mprotect(0x7ffd8fee4000, 2097152, PROT_NONE) = 0 148-mmap(0x7ffd900e4000, 32768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x7e000) = 0x7ffd900e4000 149-mmap(0x7ffd900ec000, 48960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7ffd900ec000 150-close(3) = 0

On 08/08/2013 10:13 PM, CFSworks wrote:

Hi! Sorry for the delay; I was on vacation until today.

Could you show me the output of |ls -l /usr/lib/libnvidia-ml*|?

— Reply to this email directly or view it on GitHub https://github.com/CFSworks/nvml_fix/issues/2#issuecomment-22373201.

arich82 commented 10 years ago

@mrj10: I'm afraid I'm getting the same error with .so.319.60 on Ubuntu 13.04 (64-bit). Did you ever get this to work?

@CFSworks: I've also tried using the new "nvml.h" file from August 1st version of the Tesla Deployment Kit (NVML Version 5). In the new header, there's a #define to replace 'nvmlInit', 'nvmlDeviceGetHandleByIndex', and 'nvmlDeviceGetHandleByPciBusId' with their "_v2" counterparts, so I simply commented out the lines in nvml_fix.c which seemed to serve the same purpose (i.e. lines 7; 9, 11, 15; 23, 25, 29; 35; 70, 74). This compiles fine with the new header (GCC 4.7), but the "Mismatch" error persists.

I see on the other thread ("New driver version 325.15") that several other people are reporting the same problem. Is it only with 64-bit Linux kernels?

I'd really appreciate any help in getting this to work.

stiobhan commented 10 years ago

I had the same issue on Ubuntu 12.04 x86_64. Install gcc-4.4 and compile with that version instead. Here's a list of gcc versions I tried that didn't work (with the compiled library I get the "Mismatch in versions between nvidia-smi and NVML." error): gcc (Ubuntu/Linaro 4.6.4-1ubuntu1~12.04) 4.6.4 gcc-4.7 (Ubuntu/Linaro 4.7.3-2ubuntu1~12.04) 4.7.3 gcc-4.8 (Ubuntu 4.8.1-2ubuntu1~12.04) 4.8.1

And finally the one that made it work as expected: gcc-4.4 (Ubuntu/Linaro 4.4.7-1ubuntu2) 4.4.7

See my other comment on #3 for how to fix this with newer gcc versions.