Closed feiga closed 8 years ago
I'm running on Ubuntu 14.04
@Artemy-Mellanox @alaahl I would bet that nv_peer_mem was built and installed for wrong kernel
Can we get kernel version and dmesg output?
@feiga could you also write the version of CUDA you are using? I think we have a symbol versioning issue with CUDA 8.
Thanks! I'm using cuda 7.5. The kernel version is 3.13.0-98-generic.
@alaahl @haggaie This is the dmesg output
[ 36.849441] nvidia 0000:33:00.0: irq 253 for MSI/MSI-X
[ 366.264584] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[ 366.264587] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[ 366.264590] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[ 366.264591] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[ 366.264610] nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
[ 366.264611] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)
[ 366.264620] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[ 366.264621] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)
[ 372.562565] nv_tco: NV TCO WatchDog Timer Driver v0.01
[ 406.887884] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[ 406.887888] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[ 406.887892] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[ 406.887893] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[ 406.887912] nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
[ 406.887912] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)
[ 406.887920] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[ 406.887921] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)
[ 406.889608] init: nv_peer_mem pre-start process (2463) terminated with status 1
[ 501.183418] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[ 501.183421] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[ 501.183425] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[ 501.183426] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[ 501.183445] nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
[ 501.183446] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)
[ 501.183453] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[ 501.183454] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)
[ 501.184658] init: nv_peer_mem pre-start process (2558) terminated with status 1
[ 514.277683] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[ 514.277685] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[ 514.277689] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[ 514.277690] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[ 514.277707] nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
[ 514.277707] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)
[ 514.277715] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[ 514.277716] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)
[ 514.278976] init: nv_peer_mem pre-start process (2615) terminated with status 1
[ 1190.978791] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[ 1190.978794] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[ 1190.978798] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[ 1190.978799] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[ 1190.978821] nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
[ 1190.978822] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)
[ 1190.978833] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[ 1190.978834] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)
[ 1194.929216] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[ 1194.929218] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[ 1194.929221] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[ 1194.929222] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[ 1194.929229] nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
[ 1194.929230] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)
[ 1194.929237] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[ 1194.929238] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)
[ 1226.456803] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[ 1226.456806] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[ 1226.456809] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[ 1226.456810] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[ 1226.456819] nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
[ 1226.456820] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)
[ 1226.456827] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[ 1226.456828] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)
[ 1262.102732] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[ 1262.102734] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[ 1262.102737] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[ 1262.102738] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[ 1262.102745] nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
[ 1262.102746] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)
[ 1262.102753] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[ 1262.102754] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)
[ 4724.350481] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[ 4724.350484] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[ 4724.350487] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[ 4724.350488] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[ 4724.350512] nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
[ 4724.350513] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)
[ 4724.350522] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[ 4724.350523] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)
[ 5126.283390] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[ 5126.283392] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[ 5126.283395] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[ 5126.283396] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[ 5126.283403] nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
[ 5126.283404] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)
[ 5126.283412] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[ 5126.283412] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)
[ 5142.260533] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[ 5142.260535] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[ 5142.260538] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[ 5142.260539] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[ 5142.260545] nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
[ 5142.260546] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)
[ 5142.260553] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[ 5142.260554] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)
[ 5691.839644] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[ 5691.839647] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[ 5691.839650] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[ 5691.839651] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[ 5691.839674] nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
[ 5691.839675] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)
[ 5691.839685] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[ 5691.839686] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)
[40971.122188] perf samples too long (4833 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
[41130.722132] SGI XFS with ACLs, security attributes, realtime, large block/inode numbers, no debug enabled
[41130.726453] JFS: nTxBlock = 8192, nTxLock = 65536
[41130.734010] NTFS driver 2.1.30 [Flags: R/O MODULE].
[41130.744754] QNX4 filesystem 0.2.3 registered.
[41130.749174] xor: automatically using best checksumming function:
[41130.786908] avx : 23651.000 MB/sec
[41130.854985] raid6: sse2x1 8658 MB/s
[41130.923061] raid6: sse2x2 11220 MB/s
[41130.991184] raid6: sse2x4 13236 MB/s
[41130.991185] raid6: using algorithm sse2x4 (13236 MB/s)
[41130.991186] raid6: using ssse3x2 recovery algorithm
[41131.049154] bio: create slab
Yes, this issue is because we ship or own symbol version file for the CUDA driver instead of using what is actually installed. We need to either find the symbol file from the CUDA driver build on the system being set up, or use symbol_get as was done at https://github.com/drossetti/nv_peer_memory/commit/ea51a48e21d124fff034a9d5b019d757803561d8.
Finally it works. Thanks a lot!
Hi, @feiga , I met the same problem while installing gpu dicrect rdma drivers for cntk in a docker. Could you share how did you solve that issue?
@ferasd has this been fixed in 1.0-5 ?
I encounter another error when running "sudo dpkg -i nvidia-peer-memory-dkms_1.0-5_all.deb":
(Reading database ... 197084 files and directories currently installed.)
Preparing to unpack nvidia-peer-memory-dkms_1.0-5_all.deb ...
------------------------------
Deleting module version: 1.0
completely from the DKMS tree.
------------------------------
Done.
Unpacking nvidia-peer-memory-dkms (1.0-5) over (1.0-5) ...
Setting up nvidia-peer-memory-dkms (1.0-5) ...
Creating symlink /var/lib/dkms/nvidia-peer-memory/1.0/source ->
/usr/src/nvidia-peer-memory-1.0
DKMS: add completed.
Kernel preparation unnecessary for this kernel. Skipping...
Building module:
cleaning build area....
make KERNELRELEASE=4.10.0-37-generic all KVER=4.10.0-37-generic KDIR=/lib/modules/4.10.0-37-generic/build....(bad exit status: 2)
Error! Bad return status for module build on kernel: 4.10.0-37-generic (x86_64)
Consult /var/lib/dkms/nvidia-peer-memory/1.0/build/make.log for more information.
modprobe: FATAL: Module nv_peer_mem not found in directory /lib/modules/4.10.0-37-generic
I check and find that nv_peer_mem is actually in "/lib/modules/4.10.0-37-generic/". The content of the log file "/var/lib/dkms/nvidia-peer-memory/1.0/build/make.log" is:
DKMS make.log for nvidia-peer-memory-1.0 for kernel 4.10.0-37-generic (x86_64)
Mon Oct 23 07:36:57 UTC 2017
/var/lib/dkms/nvidia-peer-memory/1.0/build/create_nv.symvers.sh 4.10.0-37-generic
-W- Could not get list of nvidia symbols.
Found /usr/src/nvidia-384-384.69/nvidia/nv-p2p.h
/bin/cp -f /usr/src/nvidia-384-384.69/nvidia/nv-p2p.h /var/lib/dkms/nvidia-peer-memory/1.0/build/nv-p2p.h
cp -rf /Module.symvers .
cp: cannot stat '/Module.symvers': No such file or directory
Makefile:48: recipe for target 'all' failed
make: *** [all] Error 1
Please help!
I install the development version 1.0.5 on Ubuntu 16.04.
@everyone I solved this problem by installing MLNX_OFED 2.1 (http://www.mellanox.com/page/products_dyn?product_family=26). I don't really know what's going on, but it's in the prerequisites.
@experiencor Which ofed you had before installing MLNX_OFED 2.1 ?
@feiga yup. It's MLNX_OFED 2.1.
@haggaie I cannot understand what you're saying, can you give more explanations? I tried with "drossetti/nv_peer_memory@ea51a48.", but I failed again.
@sj6077 we had an issue that failed linking with the NVIDIA driver in some cases. Eventually the solution was different from the one in the patch cited above. Instead we take the symbol versions of the NVIDIA driver currently installed (see https://github.com/Mellanox/nv_peer_memory/commit/e8d047e64ac9f499174d5b2063ec694eb9b5de9a).
Anyway, if you are using the latest version, perhaps you should report a new issue, and explain what kind of errors you are seeing.
@rleon
I'm trying to install this module for GPU Direct RDMA. But the error occurs when I install
sudo dpkg -i nvidia-peer-memory-dkms_1.0-1_all.deb
Do you know what's the problems? Thanks!