Mellanox / nv_peer_memory

305 stars 61 forks source link

centos 7 problem:modprobe: ERROR: could not insert 'nv_peer_mem': Invalid argument #43

Closed Chenxinjian closed 6 years ago

Chenxinjian commented 6 years ago

[user@bogon ~]$ cd rpmbuild/ [user@bogon rpmbuild]$ cd RPMS/ [user@bogon RPMS]$ cd x86_64/ [user@bogon x86_64]$ ls nvidia_peer_memory-1.0-7.x86_64.rpm [user@bogon x86_64]$ rpm -ivh nvidia_peer_memory-1.0-7.x86_64.rpm error: can't create transaction lock on /var/lib/rpm/.rpm.lock (Permission denied) [user@bogon x86_64]$ sudo rpm -ivh nvidia_peer_memory-1.0-7.x86_64.rpm Preparing... ################################# [100%] Updating / installing... 1:nvidia_peer_memory-1.0-7 ################################# [100%] modprobe: ERROR: could not insert 'nv_peer_mem': Invalid argument

[user@bogon ~]$ cat /etc/redhat-release CentOS Linux release 7.5.1804 (Core) [user@bogon ~]$ uname -a Linux gpu0 3.10.0-862.el7.x86_64 #1 SMP Fri Apr 20 16:44:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux [user@ bogon ~]$ lspci |grep mellanox -i 01:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3] [user@ bogon ~]$ ofed_info|head -1 MLNX_OFED_LINUX-4.4-1.0.0.0 (OFED-4.4-1.0.0): [user@ bogon ~]$ nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2017 NVIDIA Corporation Built on Fri_Sep__1_21:08:03_CDT_2017 Cuda compilation tools, release 9.0, V9.0.176

EdwardZhang88 commented 6 years ago

Hi, @Chenxinjian have you figured out how to solve this problem? I have the same issue in my CentOS 7.4 server.

alaahl commented 6 years ago

Hi, what errors did you get in "dmesg" output ?

anyway, I suggest to recompile the module (rebuild nvidia_peer_memory-1.0-7.x86_64.rpm) , probably something was changed on the system and some symbol versions were changed...

Chenxinjian commented 6 years ago

@EdwardZhang88 I just reinstall my system to centos7.4,and I successfully installed nv_peer_mem,but before I install nv_peer_mem, i rebuild it,maybe you can try it.

[user@gpu1 x86_64]$ sudo rpm -ivh nvidia_peer_memory-1.0-7.x86_64.rpm [sudo] password for user: Preparing... ################################# [100%] Updating / installing... 1:nvidia_peer_memory-1.0-7 ################################# [100%] [user@gpu1 x86_64]$ sudo service nv_peer_mem start starting... OK

EdwardZhang88 commented 6 years ago

@Chenxinjian Thanks for your reply. Hi @alaahl Is it true that nv_peer_mem is NOT supported by MT27710 family[ConnectX-4 Lx] which is only 25Gb RoCE? dmesg shows that various nvidiap2p* symbols are unknown.

alaahl commented 6 years ago

Hi @EdwardZhang88 , unknown symbol errors in dmesg are not related to which card is installed in the system. it's just a mismatch between the module.

If MLNX_OFED and nvidia packages are properly installed on the system now, try to rebuild nv_peer_mem module (make sure to use latest version), and if there are still issues in module load, please attach the full build log along with dmesg output.

EdwardZhang88 commented 6 years ago

Hi @alaahl I noticed that there was an error (nvidia.ko.xz file format not recognized) when I ran the rpmbuild to rebuild the nv_peer_mem module. Do you have any idea what could be the cause?

screen shot 2018-08-30 at 4 57 20 pm

And this is the dmesg error log.

screen shot 2018-08-30 at 5 06 44 pm
alaahl commented 6 years ago

so it looks similar to issue #40 we'll fix it soon.

if you need this urgently, you can try the WA from issue #40 for now.

EdwardZhang88 commented 6 years ago

@alaahl Great! It works. Thanks for your help.

alaahl commented 6 years ago

@EdwardZhang88 , you're welcome.