Mellanox / nv_peer_memory

309 stars 62 forks source link

Help with installation: service file and .conf file generated, but no nv_peer_mem.ko #90

Closed bderksen20 closed 3 years ago

bderksen20 commented 3 years ago

Hello! I'm trying to install nv_peer_mem for use in an NVSHMEM container application and having some issues.

After following the install instructions (on a cluster that supposedly has mellanox ofed installed), I receive the below output when attempting dpkg -i nvidia-peer-memory-dkms_1.1-0_all.deb

image

The non-dkms nv_peer_mem .deb file installs just fine and the nv_peer_mem service file and .conf file are generated on my system, but the .ko file is missing. Does this simply indicate that there is something wrong with my Mellanox OFED installation?

bderksen20 commented 3 years ago

Hello! I'm trying to install nv_peer_mem for use in an NVSHMEM container application and having some issues.

After following the install instructions (on a cluster that supposedly has mellanox ofed installed), I receive the below output when attempting dpkg -i nvidia-peer-memory-dkms_1.1-0_all.deb

image

The non-dkms nv_peer_mem .deb file installs just fine and the nv_peer_mem service file and .conf file are generated on my system, but the .ko file is missing. Does this simply indicate that there is something wrong with my Mellanox OFED installation?

I reinstalled Mellanox OFED with dkms support and that seems to have alleviated this issue, but now I am getting a new error when I attempt dpkg - i ../nvidia-peer-memory-dkms_1.1-0_all.deb...

terminal output ....

(Reading database ... 258568 files and directories currently installed.)
Preparing to unpack .../nvidia-peer-memory_1.1-0_all.deb ...
Unpacking nvidia-peer-memory (1.1-0) over (1.1-0) ...
Selecting previously unselected package nvidia-peer-memory-dkms.
Preparing to unpack .../nvidia-peer-memory-dkms_1.1-0_all.deb ...
Unpacking nvidia-peer-memory-dkms (1.1-0) ...
Setting up nvidia-peer-memory (1.1-0) ...
Setting up nvidia-peer-memory-dkms (1.1-0) ...
Loading new nv_peer_mem-1.1 DKMS files...
Building for 5.4.0-74-generic 5.4.0-77-generic
Building initial module for 5.4.0-74-generic
ERROR (dkms apport): unable to determine source package for nvidia-peer-memory-dkms
Error! Bad return status for module build on kernel: 5.4.0-74-generic (x86_64)
Consult /var/lib/dkms/nv_peer_mem/1.1/build/make.log for more information.
Processing triggers for ureadahead (0.100.0-21) ...
Processing triggers for systemd (237-3ubuntu10.38) ...

make.log file contents....

DKMS make.log for nv_peer_mem-1.1 for kernel 5.4.0-74-generic (x86_64)
Tue Jul 27 14:12:11 EDT 2021
INFO: Building with MLNX_OFED from: /usr/src/ofa_kernel/5.4.0-74-generic
awk: cannot open nvidia_peer_memory.spec (No such file or directory)
/var/lib/dkms/nv_peer_mem/1.1/build/create_nv.symvers.sh 5.4.0-74-generic
-E- Cannot locate nvidia modules!
CUDA driver must be installed before installing this package!
Makefile:109: recipe for target 'gen_nv_symvers' failed
make: *** [gen_nv_symvers] Error 1
bderksen20 commented 3 years ago

I've seemingly resolved my issue by reinstalling and updating my CUDA drivers to 11.4 in addition to the Mellanox OFED reinstall.