Mellanox / nv_peer_memory

292 stars 60 forks source link

modprobe: ERROR: could not insert 'nv_peer_mem': Unknown symbol in module #84

Open kramanella opened 3 years ago

kramanella commented 3 years ago

Trying to install nvidia_peer_memory-1.1-0.x86_64 on a RHEL 7.8 node with ofed 5.0-2.1.8.0 and hitting a modprobe error.

[root@n120 ~]# modprobe -v nv_peer_mem insmod /lib/modules/3.10.0-1127.19.1.el7.x86_64/extra/nv_peer_mem.ko modprobe: ERROR: could not insert 'nv_peer_mem': Unknown symbol in module, or unknown parameter (see dmesg)

From dmesg: ... [Fri Feb 12 11:47:20 2021] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err 0) [Fri Feb 12 11:47:20 2021] nv_peer_mem: Unknown symbol ib_unregister_peer_memory_client (err 0) [Fri Feb 12 11:49:38 2021] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err 0) [Fri Feb 12 11:49:38 2021] nv_peer_mem: Unknown symbol ib_unregister_peer_memory_client (err 0)

Thanks in advance! nv_peer_mem-modprobe.txt

File attached with output requested from similar issue.

ferasd commented 3 years ago

Did you upgrade OFED after installing nv_peer_mem? try to remove nv_peer_mem and install again, this should fix the issue

kramanella commented 3 years ago

Reinstalled OFED50 after installing nv_peer_mem. It builds successful but throws the warnings: depmod: WARNING: /lib/modules/3.10.0-1127.19.1.el7.x86_64/extra/nv_peer_mem.ko needs unknown symbol ib_register_peer_memory_client depmod: WARNING: /lib/modules/3.10.0-1127.19.1.el7.x86_64/extra/nv_peer_mem.ko needs unknown symbol ib_unregister_peer_memory_client

Removing nv_peer_mem and installing again doesn't fix it either.

Looking around, the module entries exist in nv_peer_mem.ko but the symbols don't. [root@n120 modules]# nm -a /lib/modules/3.10.0-1127.19.1.el7.x86_64/extra/nv_peer_mem.ko | grep -E 'ib_register_peer_memory_client|ib_unregister_peer_memory_client' U ib_register_peer_memory_client U ib_unregister_peer_memory_client

On my system OFED50 installs modules under /lib/modules/3.10.0-1127.el7.x86_64 where the kernel suffix is truncated (not sure if this is normal behavior) uname -a Linux n120 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 11 19:12:04 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

The module entires and valid symbols are found in ib_core.ko [root@n120 modules]# nm -a /lib/modules/3.10.0-1127.el7.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/core/ib_core.ko | grep -E 'ib_register_peer_memory_client|ib_unregister_peer_memory_client' 00000000fec50ade A crc_ib_register_peer_memory_client 00000000bde5c050 A __crc_ib_unregister_peer_memory_client 0000000000011450 T ib_register_peer_memory_client 00000000000113c0 T ib_unregister_peer_memory_client 00000000000003e0 r kcrctab_ib_register_peer_memory_client 0000000000000538 r kcrctab_ib_unregister_peer_memory_client 0000000000000cc3 r __kstrtab_ib_register_peer_memory_client 0000000000000ca2 r kstrtab_ib_unregister_peer_memory_client 00000000000007c0 r __ksymtab_ib_register_peer_memory_client 0000000000000a70 r __ksymtab_ib_unregister_peer_memory_client

Getting lost in this, more guidance please!

yug0slav commented 2 years ago

Installed: nvidia_peer_memory.x86_64 0:1.1-0

Complete! Uploading Enabled Repositories Report Loaded plugins: fastestmirror, langpacks, nvidia, product-id, subscription- : manager

- dmesg errors

[ 4828.021813] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err 0) [ 4828.021883] nv_peer_mem: Unknown symbol ib_unregister_peer_memory_client (err 0) [ 4850.225402] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err 0) [ 4850.225460] nv_peer_mem: Unknown symbol ib_unregister_peer_memory_client (err 0) [ 5164.062703] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err 0) [ 5164.062748] nv_peer_mem: Unknown symbol ib_unregister_peer_memory_client (err 0)

tzafrir-mellanox commented 2 years ago

Some Ubuntu kernels seem to not have CONFIG_MODVERSIONS set (while others have it set) and this breaks building nv_peer_mem.

I'm not sure it would be OK to build on a kernel with no MODVERSIONS. But if anybody wants to tackle this: I guess what you need to fix is to remove the parameter KBUILD_EXTRA_SYMBOLS= from the build command. Assuming that this actually help:

you can start by adding to the Makefile something along the lines of:

-include $(KDIR)/.config ifneq (y,$(CONFIG_MODVERSIONS))

...

Micket commented 2 years ago

I see the same error with ib_register_peer_memory_client and ib_register_peer_memory_client on Rocky Linux 8.4 with OFED 5.4-1.0.3.0, so it doesn't seem like the issue has anything to do with CONFIG_MODVERSIONS (maybe?)

After finding the 1.2 release (which doesn't seem to be mentioned on mellanox homepage?) and newer OFED drivers (which weren't listed on mellanox repo https://linux.mellanox.com/public/repo/mlnx_ofed/ for some reason... :-1: ) I managed to build this successfully.