Open kramanella opened 3 years ago
Did you upgrade OFED after installing nv_peer_mem? try to remove nv_peer_mem and install again, this should fix the issue
Reinstalled OFED50 after installing nv_peer_mem. It builds successful but throws the warnings: depmod: WARNING: /lib/modules/3.10.0-1127.19.1.el7.x86_64/extra/nv_peer_mem.ko needs unknown symbol ib_register_peer_memory_client depmod: WARNING: /lib/modules/3.10.0-1127.19.1.el7.x86_64/extra/nv_peer_mem.ko needs unknown symbol ib_unregister_peer_memory_client
Removing nv_peer_mem and installing again doesn't fix it either.
Looking around, the module entries exist in nv_peer_mem.ko but the symbols don't. [root@n120 modules]# nm -a /lib/modules/3.10.0-1127.19.1.el7.x86_64/extra/nv_peer_mem.ko | grep -E 'ib_register_peer_memory_client|ib_unregister_peer_memory_client' U ib_register_peer_memory_client U ib_unregister_peer_memory_client
On my system OFED50 installs modules under /lib/modules/3.10.0-1127.el7.x86_64 where the kernel suffix is truncated (not sure if this is normal behavior) uname -a Linux n120 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 11 19:12:04 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
The module entires and valid symbols are found in ib_core.ko [root@n120 modules]# nm -a /lib/modules/3.10.0-1127.el7.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/core/ib_core.ko | grep -E 'ib_register_peer_memory_client|ib_unregister_peer_memory_client' 00000000fec50ade A crc_ib_register_peer_memory_client 00000000bde5c050 A __crc_ib_unregister_peer_memory_client 0000000000011450 T ib_register_peer_memory_client 00000000000113c0 T ib_unregister_peer_memory_client 00000000000003e0 r kcrctab_ib_register_peer_memory_client 0000000000000538 r kcrctab_ib_unregister_peer_memory_client 0000000000000cc3 r __kstrtab_ib_register_peer_memory_client 0000000000000ca2 r kstrtab_ib_unregister_peer_memory_client 00000000000007c0 r __ksymtab_ib_register_peer_memory_client 0000000000000a70 r __ksymtab_ib_unregister_peer_memory_client
Getting lost in this, more guidance please!
similar issue on 3.10.0-1160.42.2.el7.x86_64 kernel
rpm install output
Running transaction
Installing : nvidia_peer_memory-1.1-0.x86_64 1/1
modprobe: ERROR: could not insert 'nv_peer_mem': Unknown symbol in module, or unknown parameter (see dmesg)
Uploading Package Profile
Loaded plugins: fastestmirror, langpacks, nvidia, product-id, subscription-
: manager
Loaded plugins: fastestmirror, langpacks, nvidia, product-id, subscription-
: manager
Verifying : nvidia_peer_memory-1.1-0.x86_64 1/1
Installed: nvidia_peer_memory.x86_64 0:1.1-0
Complete! Uploading Enabled Repositories Report Loaded plugins: fastestmirror, langpacks, nvidia, product-id, subscription- : manager
- dmesg errors
[ 4828.021813] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err 0) [ 4828.021883] nv_peer_mem: Unknown symbol ib_unregister_peer_memory_client (err 0) [ 4850.225402] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err 0) [ 4850.225460] nv_peer_mem: Unknown symbol ib_unregister_peer_memory_client (err 0) [ 5164.062703] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err 0) [ 5164.062748] nv_peer_mem: Unknown symbol ib_unregister_peer_memory_client (err 0)
Some Ubuntu kernels seem to not have CONFIG_MODVERSIONS set (while others have it set) and this breaks building nv_peer_mem.
I'm not sure it would be OK to build on a kernel with no MODVERSIONS. But if anybody wants to tackle this: I guess what you need to fix is to remove the parameter KBUILD_EXTRA_SYMBOLS= from the build command. Assuming that this actually help:
you can start by adding to the Makefile something along the lines of:
-include $(KDIR)/.config ifneq (y,$(CONFIG_MODVERSIONS))
...
I see the same error with ib_register_peer_memory_client and ib_register_peer_memory_client on Rocky Linux 8.4 with OFED 5.4-1.0.3.0, so it doesn't seem like the issue has anything to do with CONFIG_MODVERSIONS (maybe?)
After finding the 1.2 release (which doesn't seem to be mentioned on mellanox homepage?) and newer OFED drivers (which weren't listed on mellanox repo https://linux.mellanox.com/public/repo/mlnx_ofed/ for some reason... :-1: ) I managed to build this successfully.
Trying to install nvidia_peer_memory-1.1-0.x86_64 on a RHEL 7.8 node with ofed 5.0-2.1.8.0 and hitting a modprobe error.
[root@n120 ~]# modprobe -v nv_peer_mem insmod /lib/modules/3.10.0-1127.19.1.el7.x86_64/extra/nv_peer_mem.ko modprobe: ERROR: could not insert 'nv_peer_mem': Unknown symbol in module, or unknown parameter (see dmesg)
From dmesg: ... [Fri Feb 12 11:47:20 2021] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err 0) [Fri Feb 12 11:47:20 2021] nv_peer_mem: Unknown symbol ib_unregister_peer_memory_client (err 0) [Fri Feb 12 11:49:38 2021] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err 0) [Fri Feb 12 11:49:38 2021] nv_peer_mem: Unknown symbol ib_unregister_peer_memory_client (err 0)
Thanks in advance! nv_peer_mem-modprobe.txt
File attached with output requested from similar issue.