MicrosoftDocs / azure-docs

Open source documentation of Microsoft Azure
https://docs.microsoft.com/azure
Creative Commons Attribution 4.0 International
10.28k stars 21.46k forks source link

NVIDIA CUDA drivers and Azure RDMA drivers cannot be used simultaneously as described by the documentation. #39066

Closed dops0 closed 5 years ago

dops0 commented 5 years ago

The documentation here describes how to install CUDA drivers on NC Azure instances. This works fine to install NVIDIA drivers, however, the documentation goes on to describe how to enable RDMA on compatible NCr instances and this section doesn't seem to be working as expected.

Installing the NVIDIA CUDA drivers as described by this documentation on an NC24rR instance will break RDMA capability since some of the libraries are upgraded by the the way these drivers are installed. Does this mean, one can either use RDMA or use the NVIDIA GPUs even though the NCr instances are at a much higher cost ?

Please let me know how to go about installing the NVIDIA CUDA drivers on a CentOS 7.4 HPC image and have RDMA working as well.

The following errors are shown in dmesg after rebooting the instance following the NVIDIA driver installation.

[ 385.669815] hv_network_direct: disagrees about version of symbol vmbus_driver_unregister [ 385.680666] hv_network_direct: Unknown symbol vmbus_driver_unregister (err -22) [ 385.689372] hv_network_direct: disagrees about version of symbol vmbus_sendpacket [ 385.697852] hv_network_direct: Unknown symbol vmbus_sendpacket (err -22) [ 385.704883] hv_network_direct: disagrees about version of symbol vmbus_close [ 385.712162] hv_network_direct: Unknown symbol vmbus_close (err -22) [ 385.718509] hv_network_direct: disagrees about version of symbol vmbus_recvpacket_raw [ 385.726548] hv_network_direct: Unknown symbol vmbus_recvpacket_raw (err -22) [ 385.733457] hv_network_direct: disagrees about version of symbol vmbus_open [ 385.740203] hv_network_direct: Unknown symbol vmbus_open (err -22) [ 385.746272] hv_network_direct: disagrees about version of symbol vmbus_driver_register [ 385.753976] hv_network_direct: Unknown symbol vmbus_driver_register (err -22) [ 385.761256] hv_network_direct: disagrees about version of symbol vmbus_sendpacket_mpb_desc [ 385.769116] hv_network_direct: Unknown symbol vmbus_sendpacket_mpb_desc (err -22)

Thanks.

/D


Document Details

Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

mimckitt commented 5 years ago

Thanks for the feedback! We are currently investigating and will update you shortly.

mimckitt commented 5 years ago

In the doc, we do have a tip that calls out the following:

As an alternative to manual CUDA driver installation on a Linux VM, you can deploy an Azure Data Science Virtual Machine image. The DSVM editions for Ubuntu 16.04 LTS or CentOS 7.4 pre-install NVIDIA CUDA drivers, the CUDA Deep Neural Network Library, and other tools.

Would you be able to try this out and see if you can get it working?

dops0 commented 5 years ago

Not sure, because we would like to try using multiple GPU instances over an RDMA connection and we'd like to use the latest version of CUDA as well. I'll be trying to install the CUDA drivers with the run file instead of the dkms method and see if it behaves differently. Something causes the RDMA drivers to break down when CUDA is installed by the DKMS method.

mimckitt commented 5 years ago

Got it. That makes sense :)

I will play around and see what I can repro as well.

mimckitt commented 5 years ago

@cynthn do you know the contact for the N series VMs? I was going to try a repro of this but my subscription won't support the NCr instances.

dops0 commented 5 years ago

I wanted to update this issue with additional information.

I can confirm that, using the extension method documented at https://docs.microsoft.com/en-us/azure/virtual-machines/extensions/hpccompute-gpu-linux behaves the same way. The latest NVIDIA GPU driver and CUDA gets installed, but, RDMA is broken after this. Looks like NCR machines are unusable with the latest CUDA drivers from NVIDIA on CentOS HPC. Will probably have to try different drivers versions to figure out which one works, if any version every worked!

2019/09/19 16:38:56.103424 ERROR Command: 'modprobe hv_network_direct' 2019/09/19 16:38:56.103777 ERROR Return code: 1 2019/09/19 16:38:56.104513 ERROR Result: modprobe: ERROR: could not insert 'hv_network_direct': Invalid argument 2019/09/19 16:38:56.104914 ERROR RDMA: failed to load module hv_network_direct

mimckitt commented 5 years ago

Thanks for this info @dops0 ! This will be very valuable for the engineering teams. I am working offline to figure out what is going on and any recommended solutions/ work arounds. We will update this post once we have more.

mimckitt commented 5 years ago

Just an FYI, we are still working on this.

as a temp work aroudn you can re-install LIS RDMA package.

The hv_vmbus.ko that comes with the LIS RDMA package was reverted to build-in hv_vmbus.ko during NVIDIA installation.

I am working to determine if this is something that will be patched or we need to add it to the documentation.

mimckitt commented 5 years ago

@dops0 I added a note to the troubleshooting section of this document stating that if you find RDMA connectivity is lost after updating to the latest NVIDIA drivers to reinstall the RDMA drivers to reestablish that connectivity. This should be corrected in a future update of Nvidia but for now the warning should be enough.

Once the PR merges the changes will go live after a few hours.