Mellanox / nv_peer_memory

305 stars 61 forks source link

modprobe: ERROR: could not insert 'nv_peer_mem': Invalid argument #28

Closed MatthiasDING closed 6 years ago

MatthiasDING commented 6 years ago

My system setting: System: Ubuntu 16.04 CUDA Version: 9.0 GPU Driver Version: 387.26.

I'm trying to install this module for GPU Direct RDMA. But the error occurs when I install sudo dpkg -i nvidia-peer-memory-dkms_1.0-5_all.deb

(Reading database ... 162399 files and directories currently installed.)
Preparing to unpack nvidia-peer-memory-dkms_1.0-5_all.deb ...

-------- Uninstall Beginning --------
Module:  nvidia-peer-memory
Version: 1.0
Kernel:  4.4.0-104-generic (x86_64)
-------------------------------------

Status: Before uninstall, this module version was ACTIVE on this kernel.

nv_peer_mem.ko:
 - Uninstallation
   - Deleting from: /lib/modules/4.4.0-104-generic/updates/dkms/
 - Original module
   - No original module was found for this module on this kernel.
   - Use the dkms install command to reinstall any previous module version.

depmod....

DKMS: uninstall completed.

------------------------------
Deleting module version: 1.0
completely from the DKMS tree.
------------------------------
Done.
Unpacking nvidia-peer-memory-dkms (1.0-5) over (1.0-5) ...
Setting up nvidia-peer-memory-dkms (1.0-5) ...

Creating symlink /var/lib/dkms/nvidia-peer-memory/1.0/source ->
                 /usr/src/nvidia-peer-memory-1.0

DKMS: add completed.

Kernel preparation unnecessary for this kernel.  Skipping...

Building module:
cleaning build area....
make KERNELRELEASE=4.4.0-104-generic all KVER=4.4.0-104-generic KDIR=/lib/modules/4.4.0-104-generic/build....
cleaning build area....

DKMS: build completed.

nv_peer_mem:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/4.4.0-104-generic/updates/dkms/

depmod....

DKMS: install completed.
modprobe: ERROR: could not insert 'nv_peer_mem': Invalid argument
ferasd commented 6 years ago

Adding @alaahl @MatthiasDING can you please add dmesg errors

MatthiasDING commented 6 years ago

dmesg errors

nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)

@ferasd @alaahl

alaahl commented 6 years ago

Hi @MatthiasDING , looks like issue with MLNX_OFED Modules.symvers file;, probably wrong file was used.

Which MLNX_OFED version is installed on the system? Also, please provide the outputs of these 2 commands:

ls -l /lib/modules

ls -l /usr/src/ofa_kernel/

MatthiasDING commented 6 years ago

MLNX_OFED version: MLNX_OFED_LINUX-4.2-1.2.0.0-ubuntu16.04-x86_64

Input: ls -l /lib/modules

total 8
drwxr-xr-x 6 root root 4096 Jan  7 19:27 4.4.0-104-generic
drwxr-xr-x 6 root root 4096 Dec 27 06:55 4.4.0-21-generic

ls -l /usr/src/ofa_kernel/

drwxr-xr-x 7 root root 4096 Dec 27 14:36 4.4.0-104-generic
drwxr-xr-x 7 root root 4096 Dec 27 06:53 4.4.0-21-generic
lrwxrwxrwx 1 root root   16 Dec 27 06:53 default -> 4.4.0-21-generic

@alaahl

alaahl commented 6 years ago

Thanks @MatthiasDING

This confirms what I suspected; it used the Modules.symvers from /usr/src/ofa_kernel/default which points to headers built for 4.4.0-21-generic, but you are compiling against 4.4.0-104-generic.

I will fix the Makefile to use the correct file. But, for now, you can workaround it by changing the "default" link to point to the newer kernel, run (using root): cd /usr/src/ofa_kernel ln -snf 4.4.0-104-generic default

Now, try to install nvidia-peer-memory-dkms again. this time it should use the correct symvers file and the module should load.

MatthiasDING commented 6 years ago

solved!!!. Thanks @alaahl