Mellanox / nv_peer_memory

292 stars 60 forks source link

Error: nv_peer_mem: Unknown symbol ib_register_peer_memory_client #79

Closed NHZlX closed 3 years ago

NHZlX commented 3 years ago

Hi, i met the same problem as https://github.com/Mellanox/nv_peer_memory/issues/28

More Information:

# lspci | grep Mell
0000:b3:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
0000:b3:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

# ofed_info -n
5.0-2.1.8

# ls -l /lib/modules
all 24
drwxr-xr-x  7 root root 4096 11月 24 20:11 3.10.0-1062.18.1.el7.x86_64
drwxr-xr-x  3 root root 4096 4月  22 2020 3.10.0-1062.9.1.el7.x86_64
drwxr-xr-x  3 root root 4096 4月  20 2020 3.10.0-1062.el7.x86_64
drwxr-xr-x  3 root root 4096 12月 25 2019 3.10.0-957.21.3.el7.x86_64
drwxr-xr-x. 3 root root 4096 4月  20 2020 3.10.0-957.el7.x86_64
drwxr-xr-x  6 root root 4096 11月 25 15:41 4.19.95-7

# ls -l /usr/src/ofa_kernel/
all 4
drwxr-xr-x 7 root root 4096 8月   5 16:12 default

Help welcome!

NHZlX commented 3 years ago

Run the following command, print nothing.

sudo cat /proc/kallsyms | grep ib_register_peer_memory_client
NHZlX commented 3 years ago

fix it by reinstall the mellanox dirver

fanfanaaaa commented 1 year ago

@NHZlX hi, I met the same problem like you, but when I executed the command: sudo cat /proc/kallsyms | grep ib_register_peer_memory_client I got the following output:

ffffffffa0de25ac r __kstrtab_ib_register_peer_memory_client [ib_core]
ffffffffa0de25cb r __kstrtabns_ib_register_peer_memory_client   [ib_core]
ffffffffa0ddc54c r __ksymtab_ib_register_peer_memory_client [ib_core]
ffffffffa0ddb668 t ib_register_peer_memory_client.cold  [ib_core]
ffffffffa0dd8c60 T ib_register_peer_memory_client   [ib_core]

So I think I don't need to reinstall the Mellanox driver, could you give me some help? More infomation:

uname -r

5.12.0-xrp-vhost-blk+

nvidia-smi

Wed May 17 06:33:53 2023
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.86.01 Driver Version: 515.86.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100 80G... Off | 00000000:31:00.0 Off | 0 | | N/A 44C P0 68W / 300W | 1459MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A800 80G... Off | 00000000:4B:00.0 Off | 0 | | N/A 49C P0 75W / 300W | 971MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 3747496 C python 1457MiB | | 1 N/A N/A 3747496 C python 969MiB | +-----------------------------------------------------------------------------+

ofed_info -s

OFED-internal-5.8-1.1.2:

ls -l /lib/modules

total 44 drwxr-xr-x 4 root root 4096 May 5 03:44 5.12.0-xrp+ drwxr-xr-x 4 root root 4096 May 5 03:44 5.12.0-xrp-vhost-blk+ drwxr-xr-x 3 root root 4096 Feb 10 06:44 5.4.0-132-generic drwxr-xr-x 2 root root 4096 Jan 7 06:17 5.4.0-135-generic drwxr-xr-x 2 root root 4096 Jan 13 06:14 5.4.0-136-generic drwxr-xr-x 5 root root 4096 Jan 13 06:13 5.4.0-137-generic drwxr-xr-x 5 root root 4096 Feb 10 06:43 5.4.0-139-generic drwxr-xr-x 3 root root 4096 May 5 03:43 5.4.0-146-generic drwxr-xr-x 2 root root 4096 May 5 02:24 5.4.0-148-generic drwxr-xr-x 3 root root 4096 May 5 03:43 5.4.0-rc8+ drwxr-xr-x 4 root root 4096 May 5 02:27 6.1.0-KVM_EXIT_EFAULT_from_lwn_for_hyperdisk_dev-ga530af7b1987

ls -l /usr/src/ofa_kernel/

total 4 lrwxrwxrwx 1 root root 36 Mar 22 10:39 default -> /etc/alternatives/ofa_kernel_headers drwxr-xr-x 8 root root 4096 May 5 03:39 x86_64

dmesg

[610939.210200] nvidia_peermem: disagrees about version of symbol ib_register_peer_memory_client [610939.210206] nvidia_peermem: Unknown symbol ib_register_peer_memory_client (err -22) [610985.271522] nvidia_peermem: disagrees about version of symbol ib_register_peer_memory_client [610985.271530] nvidia_peermem: Unknown symbol ib_register_peer_memory_client (err -22) [612950.903683] nvidia_peermem: disagrees about version of symbol ib_register_peer_memory_client [612950.903694] nvidia_peermem: Unknown symbol ib_register_peer_memory_client (err -22)

BEYHHH commented 1 year ago

I meet same problem with you, do you have any way to fix it?

DaveiV commented 6 months ago

i meet same problem . with OFED driver - version 5.8.0