Mellanox / nv_peer_memory

311 stars 62 forks source link

Ubuntu deb missing dependency on mlnx-ofed-kernel #10

Closed drossetti closed 6 years ago

drossetti commented 7 years ago

The problem we are seeing on Ubuntu is that after a kernel + MLNX OFED upgrade, DKMS could try to build nv_peer_mem before ofa_kernel, so /var/lib/dkms/mlnx-ofed-kernel/3.4/build/Module.symvers file is not present yet:

$ cat /var/lib/dkms/nvidia-peer-memory/1.1/build/make.log DKMS make.log for nvidia-peer-memory-1.1 for kernel 4.2.0-27-generic (x86_64) Tue Nov 22 10:47:32 PST 2016 cp -rf /Module.symvers . cp: cannot stat ‘/Module.symvers’: No such file or directory

drossetti commented 7 years ago

the reason why 1.1 is shown in the log above is that we are still using the fork at drossetti/nv_peer_memory

haggaie commented 7 years ago

@alaahl, could you have a look? I think we need to:

  1. Add ofa_kernel and nvidia kernel module to the Depends: line in the debian control file.
  2. Add these modules to the dkms.conf's BUILD_DEPENDS line.
alaahl commented 7 years ago

Hi @haggaie , You are right, they should be added to Depends tag.

Regarding the kernel upgrade; I faced the same issue in the past.. DKMS seems to build everything in parallel.. Back then I disabled the AUTOINSTALL for the packages that needed ofa-kernel, and used POST_INSTALL in ofa-kernel dkms.conf to run a script that will build and install the dependent modules against the new kernel and ofa-kernel.

I never used BUILD_DEPENDS.. we can try it. However, I see that it's not supported on Ubuntu14.04 (and probably on older versions too).. So not sure that it the way to go...

haggaie commented 7 years ago

@alaahl, is there any harm in adding BUILD_DEPENDS just for 16.04? Will it break 14.04 if it is there?

alaahl commented 7 years ago

@haggaie , AFAIK they take only supported variables from dkms.conf, so BUILD_DEPENDS will be silently ignored when it's not supported by the DKMS tools (e.g. on 14.04).

JohnSpillerNvidia commented 7 years ago

any progress on this?

alaahl commented 6 years ago

I'll handle this.

JohnSpillerNvidia commented 6 years ago

Why is there a dependence on cuda? The Nvidia DGX-1 series installs nvidia-peer-memory, but does not have cuda installed, so I fear this will break our build... I think the dependency is on nvidia (the driver) not cuda.

alaahl commented 6 years ago

I thought that the cuda (CUDA meta-package) has to be installed always. I will revert https://github.com/Mellanox/nv_peer_memory/commit/2e28f47364d3850e4c59f3f1001f61d2b2d9f79a

alaahl commented 6 years ago

@ferasd please review https://github.com/Mellanox/nv_peer_memory/pull/30

Suhoy95 commented 6 years ago

Hello,

I've got the same (or looks similary) problem on Centos 7.4:

tar xf nvidia-peer-memory_1.0.5.tar.gz
cd nvidia-peer-memory-1.0
./build_module.sh
rpmbuild --rebuild /tmp/nvidia_peer_memory-1.0-5.src.rpm

got

// ...
rpmbuild --rebuild /tmp/nvidia_peer_memory-1.0-5.src.rpm
+ cd nvidia_peer_memory-1.0
+ export KVER=3.10.0-693.el7.x86_64
+ KVER=3.10.0-693.el7.x86_64
+ make KVER=3.10.0-693.el7.x86_64 all
/root/rpmbuild/BUILD/nvidia_peer_memory-1.0/create_nv.symvers.sh 3.10.0-693.el7.x86_64
Getting symbol versions from /lib/modules/3.10.0-693.el7.x86_64/extra/nvidia.ko ...
Created: nv.symvers
Found /usr/src/nvidia-387.26/nvidia/nv-p2p.h
/bin/cp -f /usr/src/nvidia-387.26/nvidia/nv-p2p.h /root/rpmbuild/BUILD/nvidia_peer_memory-1.0/nv-p2p.h
cp -rf /Module.symvers .
cp: cannot stat '/Module.symvers': No such file or directory
make: *** [all] Error 1

Trying to add BUILD_DEPENDS="ofa_kernel nvidia" to the dkms.conf doesn't help. Also i can't find ofa_kernel (lsmod | grep ofa_kernel).

I've installed CentOS with @infiniband in anaconda kickstart-file. And I suppose that I haven't all required packages.

alaahl commented 6 years ago

Hi @Suhoy95 you should install MLNX_OFED, this module does not support the Inbox drivers. from https://github.com/Mellanox/nv_peer_memory/blob/master/README.md : Pre-requisites:

NVIDIA compatible driver is installed and up. MLNX_OFED 2.1 is installed and up.

laochonlam commented 5 years ago

any solution? thanks