v-burak commented 1 year ago

The instructions are not working for Azure VM with Ubuntu 20.04 with size Standard NC16as T4 v3

Specifically, instructions for installing the GPU drivers with Secure Boot on Ubuntu 20.04 are not working. I was able to get the extension installed successfully when the secure boot is disabled.

Document Details

⚠ Do not edit this section. It is required for learn.microsoft.com ➟ GitHub issue linking.

ID: 374c8361-9b31-2488-f04b-766e6305c91b
Version Independent ID: aebb9dd3-d1e6-4319-9ab1-0088bd8416e3
Content: NVIDIA GPU Driver Extension - Azure Linux VMs - Azure Virtual Machines
Content Source: articles/virtual-machines/extensions/hpccompute-gpu-linux.md
Service: virtual-machines
Sub-service: hpc
GitHub Login: @mamccrea
Microsoft Alias: mamccrea

VikasPullagura-MSFT commented 1 year ago

@v-burak Thanks for the feedback! I have assigned the issue to content author to check and update the document as appropriate.

VikasPullagura-MSFT commented 1 year ago

@mamccrea Can you please check and add your comments on this doc update request as applicable.

Padmalathas commented 1 year ago

Thanks for the feedback. I am assuming that you were trying to install the drivers on Ubuntu with Secure Boot enabled?? And I want to ensure that it was signed by the key?

Please note that "with Secure Boot enabled, all OS boot components (boot loader, kernel, kernel drivers) must be signed by trusted publishers [key trusted by the system]. Both Windows and select Linux distributions support Secure Boot."

Could confirm the same? Thanks!

v-burak commented 1 year ago

Yes I was trying to install the drivers on Ubuntu with Secure Boot enabled. In the wiki, there is no instructions about how to sign the OS boot components.

Padmalathas commented 1 year ago

@v-burak, I would like you to take a look at this article which outlines the instructions to install GRID driver on Ubuntu with Secure Boot enabled

Padmalathas commented 1 year ago

@v-burak, Not sure if you have had a chance to look into the shared article which outlines the instructions to install GRID driver on Ubuntu with Secure Boot ON. Kindly confirm so that we can go ahead and close this issue if there is no doc update required.

v-burak commented 1 year ago

@Padmalathas In the shared article, It says: "Install GRID driver on Ubuntu with Secure Boot enabled The GRID driver installation process does not offer any options to skip kernel module build and installation, so secure boot has to be disabled in Linux VMs in order to use them with GRID, after installing signed kernel modules." What I understand from this is that I can't install GRID drivers on Ubuntu.

In any case, in the original article that I opened, it would be good to mention that those steps works only when secure boot is disabled.

darkwhite29 commented 1 year ago

@v-burak Yes, you cannot install GRID driver with secure boot on. You can only install CUDA driver with secure boot on. This imbalance comes from GRID driver installation itself since it does not have the option to omit loading installed kernel modules and the support to choose different sources of kernel modules. Signed kernel modules thus cannot be used and GRID installation will fail in this case. It is NVIDIA who implement this and should be responsible for such an imbalance.

darkwhite29 commented 1 year ago

The doc does specifically mentioned with secure boot on, only CUDA driver installation works (detailed installation steps are given), but not GRID driver installation. If you want to use GRID driver, you have to disable secure boot. Again, this is caused by how NVIDIA implements their GRID driver installation.

To be more clear, we will add a reference sentence to the doc you are referring to. Thanks for bringing this issue up.

Padmalathas commented 1 year ago

Thank you for clarifying it. @v-burak, additionally, I will try to have a sentence included in the referenced article to avoid future confusion.

v-burak commented 1 year ago

@darkwhite29 , @Padmalathas I should note that the instructions are still not working for CUDA driver installation when secure boot is ON. They work If I turn the secure boot OFF. I should note that in my original post I was referring to CUDA driver not the GRID driver

darkwhite29 commented 1 year ago

Could you please share the error message you got at what step in the instructions? Thanks.

v-burak commented 1 year ago

It was couple of weeks ago but Azure Portal was just showing the extension deployment as failed

darkwhite29 commented 1 year ago

VMs with secure boot on require all kernel modules to be signed by the key trusted by the system. Using Azure extensions may not have the kernel modules signed and probably failed due to this.

Could you please try out the manual installation of CUDA driver with secure boot on (Trusted Launch VM)?

https://learn.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup#install-cuda-driver-on-ubuntu-with-secure-boot-enabled

v-burak commented 1 year ago

I also tried this link couple of weeks ago and it didn't work. I left a GitHub issue on that page as well

https://github.com/MicrosoftDocs/azure-docs/issues/111495#issuecomment-1621013266

As I said previously, it'd be good to update the doc saying that instructions only work when secure boot is disabled.

Best

darkwhite29 commented 1 year ago

Thanks for pointing it out. I have updated the instructions of installing CUDA driver with secure boot enabled: https://review.learn.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup?branch=pr-en-us-245529

Padmalathas commented 1 year ago

This issue has been fixed in 2 fold in both the documents to ease on the confusion. Let us know if this helps, should there be any other issue, request you to please open a new one. Thank you!

please-close

lovettchris commented 1 year ago

@darkwhite29, thanks for the updated docs, I just tried those steps but it doesn't work I still get this during the sudo apt-get install cuda:

And clicking OK here breaks the virtual machine since the BIOS will be prompting for a password. If I reboot the machine what happens is:

VMExtensionProvisioningError: VM has reported a failure when processing extension 'NvidiaGpuDriverLinux'. Error message: DPKG frontend (/var/lib/dpkg/lock-frontend) is locked by another process, please try reinstalling after sometime More information on troubleshooting is available at https://aka.ms/VMExtensionNvidiaGpuDriverLinuxTroubleshoot

This is on Standard_NV12s_v3 with cannonical 0001-com-ubuntu-server-jammy 22_04-lts-gen2.

The only way I've been able to get CUDA drivers installed is to disable secure boot. Perhaps the docs should say that for now until a dev team can really dive into this and get something that works.

darkwhite29 commented 1 year ago

Did you provide a password and proceed? It works for me after I provide a password.

darkwhite29 commented 1 year ago

As the error message says, the system has another program/application/process running that holds the lock, so you need to wait for those to finish before using the extension to install the drivers.

lovettchris commented 1 year ago

Ah, ok, I'll try again, I waited a really long time, how long did you have to wait for the driver to finish, is there something in the journal I can look for that tells me when it's done?

lovettchris commented 1 year ago

by the way I also get a warning on sudo apt-key del 7fa2af80 saying apt-key is deprecated...

darkwhite29 commented 1 year ago

Since you have rebooted the VM, I don't see any live logs you can access without going into the VM. For my case, I waited for around 10 minutes and it was good.

lovettchris commented 1 year ago

and sudo apt-get update needs to run by itself - if you try and copy this whole block and run it it stops after the update.

sudo apt-get update
sudo apt-get install cuda
sudo apt-get install nvidia-gds

Most real Linux folks will know this but novice users may not... Perhaps the sudo apt-get update should be the first step by itself even before the apt-get install linux-headers.

darkwhite29 commented 1 year ago

Yes, apt-key is deprecated. I grabbed the installation process from NVIDIA official website. If you feel inappropriate, I think you can omit this step.

NVIDIA doc: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#ubuntu

darkwhite29 commented 1 year ago

Good point. I will separate these steps.

lovettchris commented 1 year ago

oh, wait another update also has to happen after installing the keyring... and yeah, probably should keep the sudo apt-key del 7fa2af80 since it is on the NVIDIA site... I'm looking here: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

lovettchris commented 1 year ago

Oh, the $distro thing is also important, I'm on 2204, so I had to change the keyring path from ubuntu2004 to ubuntu2204...

darkwhite29 commented 1 year ago

All updated. Thanks.

https://review.learn.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup?branch=pr-en-us-245529

lovettchris commented 1 year ago

Are you entering a password here or leaving it blank?

darkwhite29 commented 1 year ago

I entered a password (twice, once for confirmation) of my choice.

lovettchris commented 1 year ago

Perhaps having these screen shots in the doc will help users be more comfortable doing this... I did this once, it didn't seem to work, and then I just threw up my hands, but if the doc shows these steps I would have tried harder...

darkwhite29 commented 1 year ago

I incorporated suggested changes and compiled a final version of instructions, which work for me. Could you please take a look if they work for you?

https://review.learn.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup?branch=pr-en-us-245529

lovettchris commented 1 year ago

The steps worked, the VM is running but something is still missing, I get this:

Python 3.10.12 (main, Jul  5 2023, 18:54:27) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
torch.>>> torch.cuda.is_available()
False

I'm investigating... Oh, I see your latest updates a lot has changed, so I'm trying those new steps.

darkwhite29 commented 1 year ago

What do you get when you run nvcc --version?

You can also run dpkg -l | grep -i cuda to view what CUDA packages have been installed.

darkwhite29 commented 1 year ago

If secure boot is not a hard requirement, you can always use our HPC images in marketplace, which have commonly used HPC packages and libraries already installed for you, including NVIDIA/CUDA drivers. Currently we have Ubuntu and Alma HPC images available. Just FYI.

lovettchris commented 1 year ago

I think the nvidia driver is still not loading because of secure boot, I get this:

(sr) smartreplayuser@srtrainerv2:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

lovettchris commented 1 year ago

as for nvcc I get Command 'nvcc' not found and when I try and install it I get these errors:

sudo apt install nvidia-cuda-toolkit
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 libcuinj64-11.5 : Depends: libnvidia-compute-495 (>= 495) but it is not going to be installed or
                            libnvidia-compute-495-server (>= 495) but it is not installable or
                            libcuda.so.1 (>= 495) or
                            libcuda-11.5-1
 libnvidia-ml-dev : Depends: libnvidia-compute-495 (>= 495) but it is not going to be installed or
                             libnvidia-compute-495-server (>= 495) but it is not installable or
                             libnvidia-ml.so.1 (>= 495)
 nvidia-cuda-dev : Depends: libnvidia-compute-495 (>= 495) but it is not going to be installed or
                            libnvidia-compute-495-server (>= 495) but it is not installable or
                            libcuda.so.1 (>= 495) or
                            libcuda-11.5-1
                   Recommends: libnvcuvid1 but it is not installable
E: Unable to correct problems, you have held broken packages.

lovettchris commented 1 year ago

The reason I'm pushing on this is I see a lot of chatter online about this and if we can't get a reliable setup on secure boot we should tell people not to try rather than given a bunch of instructions that don't work.

darkwhite29 commented 1 year ago

Could you please do a fresh installation by deleting the VM and launching a new one? It seems like you current VM has messed up different installations.

sudo apt install nvidia-cuda-toolkit is not recommended since it will downgrade your CUDA version to 10.

Your reported errors are seen in my debugging process and have been addressed in the latest instructions. Could you please try them out in a brand new VM?

Thanks!

On Thu, Jul 20, 2023 at 8:24 PM Chris Lovett @.***> wrote:

The reason I'm pushing on this is I see a lot of chatter online about this and if we can't get a reliable setup on secure boot we should tell people not to try rather than given a bunch of instructions that don't work.

— Reply to this email directly, view it on GitHub https://github.com/MicrosoftDocs/azure-docs/issues/111536#issuecomment-1644817510, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVCU2Z67EPOZKWSYOQMCEDXRHD27ANCNFSM6AAAAAAZXM5RK4 . You are receiving this because you were mentioned.Message ID: @.***>

-- Li

lovettchris commented 1 year ago

Yes, I'll start again from scratch and let you know how it goes.

lovettchris commented 1 year ago

Here's what I found:

can you add Install CUDA driver on Ubuntu with Secure Boot enabled to the toc at the top of the page
sudo apt-get update in the first block also needs to be separated or else the NVIDIA_DRIVER_VERSION never gets set.
it is picking NVIDIA_DRIVER_VERSION 535

this block also needs to be separated, I think my problem before is it never ran the second step:

sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/$distro/$arch/3bf863cc.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/$distro/$arch/ /"

What do I do about this question that pops up on the keyring install?

Configuration file '/etc/apt/preferences.d/cuda-repository-pin-600'
==> File on system created by you or by a script.
==> File also in package provided by package maintainer.
 What would you like to do about it ?  Your options are:
  Y or I  : install the package maintainer's version
  N or O  : keep your currently-installed version
    D     : show the differences between the versions
    Z     : start a shell to examine the situation
The default action is to keep your current version.
*** cuda-repository-pin-600 (Y/I/N/O/D/Z) [default=N] ?

When I type D I get this so I think I want to type Y?

*** cuda-repository-pin-600 (Y/I/N/O/D/Z) [default=N] ? --- /etc/apt/preferences.d/cuda-repository-pin-600      2023-07-21 01:44:35.520199269 +0000
+++ /etc/apt/preferences.d/cuda-repository-pin-600.dpkg-new     2023-04-20 00:36:02.000000000 +0000
@@ -1,15 +1,11 @@
Package: nsight-compute
Pin: origin *ubuntu.com*
Pin-Priority: -1
+
Package: nsight-systems
Pin: origin *ubuntu.com*
Pin-Priority: -1
-Package: nvidia-modprobe
-Pin: release l=NVIDIA CUDA
-Pin-Priority: 600
-Package: nvidia-settings
-Pin: release l=NVIDIA CUDA
-Pin-Priority: 600
+
Package: *
Pin: release l=NVIDIA CUDA
-Pin-Priority: 100
+Pin-Priority: 600

sudo apt-get install cuda prompts for password, so one has to be careful not to miss the sudo apt-get install nvidia-gds and can sudo apt-get install nvidia-gds be done before reboot? Since it is doing kernel preparation, I'm guessing it has to be done before reboot...
The reboot takes quite a lot of time (several minutes), perhaps worth mentioning, previous I thought it was horked, but it must be doing something really slow.

After reboot nvidia-smi says:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

"Verify CUDA driver is installed and loaded" the second time probably should say "Verify CUDA toolkit is installed"

darkwhite29 commented 1 year ago

This cannot be done, since the TOC has all topmost level main section titles only. "Install CUDA driver on Ubuntu with Secure Boot enabled" is a subsection under the "Install CUDA drivers on N-series VMs" main section.
Done as suggested.
Correct.
Done as suggested.
Type "Y" and enter. This has been covered in the instructions. I just made it more clear.
See 5.
Yes, both CUDA and GDS are installed before reboot.
I am not sure the reboot time for users at different locations. Hard to give a cohesive conclusion on this.
Because of mis-selection in 5.
Done as suggested.

lovettchris commented 1 year ago

I did end up typing "Y", so are you sure? I mean you have tested all this on an Azure VM with secure boot and it is working for you?

darkwhite29 commented 1 year ago

Yes, it works for me if select "Y". I can redo once to confirm.

lovettchris commented 1 year ago

Ok, I'll have to try one more time from scratch then. I'd really love to get it working... did you also verify pytorch works?

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
python
>>> import torch
>>> torch.cuda.is_available() 
True

darkwhite29 commented 1 year ago

litan2@litan2ubuntuSBtest:~$ nvidia-smi
Fri Jul 21 02:49:21 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-PCIE-16GB           Off | 00000001:00:00.0 Off |                    0 |
| N/A   35C    P0              25W / 250W |     99MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A       853      G   /usr/lib/xorg/Xorg                           82MiB |
|    0   N/A  N/A      1016      G   /usr/bin/gnome-shell                         16MiB |
+---------------------------------------------------------------------------------------+

No, I did not install Python so did not verify CUDA in Python. nvcc --version confirms the correct installation.

darkwhite29 commented 1 year ago

FYI. I am using Ubuntu Server 20.04 LTS on Standard_NC6s_v3 VM size at South Central US region.

lovettchris commented 1 year ago

Yay, it worked this time!

Fri Jul 21 02:54:25 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla M60                      Off | 00000001:00:00.0 Off |                  Off |
| N/A   22C    P8              13W / 150W |      0MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

lovettchris commented 1 year ago

Thanks so much for all your help and for improving the docs also.

MicrosoftDocs / azure-docs

Extension is not successfully installed when secure boot is ON. #111536

Document Details

please-close