Azure / azhpc-images

Azure HPC/AI VM Images
MIT License
90 stars 72 forks source link

install `nv_peer_mem` kernel for GPUDirect RDMA #316

Closed ltalirz closed 4 months ago

ltalirz commented 4 months ago

A customer on the Alma 8.7 image has run into

src/comm/transports/common/transport_ib_common.cpp:58: NULL value mem registration failed
src/comm/transports/ibrc/ibrc.cpp:486: non-zero status: 2 Unable to register memory handle.src/mem/mem.cpp:349: non-zero status: 7 transport get memhandle failed
src/mem/mem.cpp:604: non-zero status: 7 register heap handle failed
src/mem/mem.cpp:622: non-zero status: 7 add physical memory failed
src/comm/device/proxy_device.cu:701: NULL value failed allocating proxy_channel_g_bufchannel creation failed

with NVSHMEM and noticed that the nv_peer_mem kernel is missing.

$ lsmod | grep nv
nvidia_uvm           1376256  0
nvidia_drm             69632  0
nvidia_modeset       1245184  1 nvidia_drm
nvidia              56467456  106 nvidia_uvm,gdrdrv,nvidia_modeset
drm_kms_helper        176128  1 nvidia_drm
libnvdimm             200704  1 nfit
drm                   565248  4 drm_kms_helper,nvidia,nvidia_drm

See the docs https://docs.nvidia.com/nvshmem/api/faq.html

Would it make sense to add nv_peer_memory to the azhpc image? https://github.com/Mellanox/nv_peer_memory

Anything we would need to watch out for?

mentioning also @xpillons

jithinjosepkl commented 4 months ago

Yes, nv_peer_mem should be added to Alma image as well. This can be fixed in the March release.

LiquidPT commented 4 months ago

I've tried this with the latest HPC AlmaLinux 8.7 image on an NDv4 VM, and nvidia_peermem is installed with the Mellanox driver and enabled.

image

What version of the HPC AlmaLinux 8.7 image are you using and on which SKU?

ltalirz commented 4 months ago

Interesting, thanks for checking!

This was using the azhop image 2023.0705.1612, which derives from almalinux:almalinux-hpc:8_7-hpc-gen2:8.7.2023060101

@xpillons According to the azhop docs, this is the latest base image for which azhop images are available. Perhaps worth looking into upgrading this?

Mentioning @matt-chan, @adam-grofe for info

xpillons commented 4 months ago

@ltalirz building new azhop image right now

LiquidPT commented 4 months ago

Closing as we're installing nvidia_peermem with the NVIDIA drivers on the current images