Closed v-burak closed 1 year ago
@v-burak Thanks for the feedback! I have assigned the issue to content author to check and update the document as appropriate.
@mamccrea Can you please check and add your comments on this doc update request as applicable.
Thanks for the feedback. I am assuming that you were trying to install the drivers on Ubuntu with Secure Boot enabled?? And I want to ensure that it was signed by the key?
Please note that "with Secure Boot enabled, all OS boot components (boot loader, kernel, kernel drivers) must be signed by trusted publishers [key trusted by the system]. Both Windows and select Linux distributions support Secure Boot."
Could confirm the same? Thanks!
Yes I was trying to install the drivers on Ubuntu with Secure Boot enabled. In the wiki, there is no instructions about how to sign the OS boot components.
@v-burak, I would like you to take a look at this article which outlines the instructions to install GRID driver on Ubuntu with Secure Boot enabled
@v-burak, Not sure if you have had a chance to look into the shared article which outlines the instructions to install GRID driver on Ubuntu with Secure Boot ON. Kindly confirm so that we can go ahead and close this issue if there is no doc update required.
@Padmalathas In the shared article, It says: "Install GRID driver on Ubuntu with Secure Boot enabled The GRID driver installation process does not offer any options to skip kernel module build and installation, so secure boot has to be disabled in Linux VMs in order to use them with GRID, after installing signed kernel modules." What I understand from this is that I can't install GRID drivers on Ubuntu.
In any case, in the original article that I opened, it would be good to mention that those steps works only when secure boot is disabled.
@v-burak Yes, you cannot install GRID driver with secure boot on. You can only install CUDA driver with secure boot on. This imbalance comes from GRID driver installation itself since it does not have the option to omit loading installed kernel modules and the support to choose different sources of kernel modules. Signed kernel modules thus cannot be used and GRID installation will fail in this case. It is NVIDIA who implement this and should be responsible for such an imbalance.
The doc does specifically mentioned with secure boot on, only CUDA driver installation works (detailed installation steps are given), but not GRID driver installation. If you want to use GRID driver, you have to disable secure boot. Again, this is caused by how NVIDIA implements their GRID driver installation.
To be more clear, we will add a reference sentence to the doc you are referring to. Thanks for bringing this issue up.
Thank you for clarifying it. @v-burak, additionally, I will try to have a sentence included in the referenced article to avoid future confusion.
@darkwhite29 , @Padmalathas I should note that the instructions are still not working for CUDA driver installation when secure boot is ON. They work If I turn the secure boot OFF. I should note that in my original post I was referring to CUDA driver not the GRID driver
Could you please share the error message you got at what step in the instructions? Thanks.
It was couple of weeks ago but Azure Portal was just showing the extension deployment as failed
VMs with secure boot on require all kernel modules to be signed by the key trusted by the system. Using Azure extensions may not have the kernel modules signed and probably failed due to this.
Could you please try out the manual installation of CUDA driver with secure boot on (Trusted Launch VM)?
I also tried this link couple of weeks ago and it didn't work. I left a GitHub issue on that page as well
https://github.com/MicrosoftDocs/azure-docs/issues/111495#issuecomment-1621013266
As I said previously, it'd be good to update the doc saying that instructions only work when secure boot is disabled.
Best
Thanks for pointing it out. I have updated the instructions of installing CUDA driver with secure boot enabled: https://review.learn.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup?branch=pr-en-us-245529
This issue has been fixed in 2 fold in both the documents to ease on the confusion. Let us know if this helps, should there be any other issue, request you to please open a new one. Thank you!
@darkwhite29, thanks for the updated docs, I just tried those steps but it doesn't work I still get this during the sudo apt-get install cuda
:
And clicking OK here breaks the virtual machine since the BIOS will be prompting for a password. If I reboot the machine what happens is:
VMExtensionProvisioningError: VM has reported a failure when processing extension 'NvidiaGpuDriverLinux'. Error message: DPKG frontend (/var/lib/dpkg/lock-frontend) is locked by another process, please try reinstalling after sometime More information on troubleshooting is available at https://aka.ms/VMExtensionNvidiaGpuDriverLinuxTroubleshoot
This is on Standard_NV12s_v3
with cannonical 0001-com-ubuntu-server-jammy 22_04-lts-gen2
.
The only way I've been able to get CUDA drivers installed is to disable secure boot. Perhaps the docs should say that for now until a dev team can really dive into this and get something that works.
Did you provide a password and proceed? It works for me after I provide a password.
As the error message says, the system has another program/application/process running that holds the lock, so you need to wait for those to finish before using the extension to install the drivers.
Ah, ok, I'll try again, I waited a really long time, how long did you have to wait for the driver to finish, is there something in the journal I can look for that tells me when it's done?
by the way I also get a warning on sudo apt-key del 7fa2af80
saying apt-key is deprecated...
Since you have rebooted the VM, I don't see any live logs you can access without going into the VM. For my case, I waited for around 10 minutes and it was good.
and sudo apt-get update
needs to run by itself - if you try and copy this whole block and run it it stops after the update.
sudo apt-get update
sudo apt-get install cuda
sudo apt-get install nvidia-gds
Most real Linux folks will know this but novice users may not... Perhaps the sudo apt-get update
should be the first step by itself even before the apt-get install linux-headers.
Yes, apt-key
is deprecated. I grabbed the installation process from NVIDIA official website. If you feel inappropriate, I think you can omit this step.
NVIDIA doc: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#ubuntu
Good point. I will separate these steps.
oh, wait another update also has to happen after installing the keyring... and yeah, probably should keep the sudo apt-key del 7fa2af80
since it is on the NVIDIA site... I'm looking here: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
Oh, the $distro thing is also important, I'm on 2204, so I had to change the keyring path from ubuntu2004
to ubuntu2204
...
Are you entering a password here or leaving it blank?
I entered a password (twice, once for confirmation) of my choice.
Perhaps having these screen shots in the doc will help users be more comfortable doing this... I did this once, it didn't seem to work, and then I just threw up my hands, but if the doc shows these steps I would have tried harder...
I incorporated suggested changes and compiled a final version of instructions, which work for me. Could you please take a look if they work for you?
The steps worked, the VM is running but something is still missing, I get this:
Python 3.10.12 (main, Jul 5 2023, 18:54:27) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
torch.>>> torch.cuda.is_available()
False
I'm investigating... Oh, I see your latest updates a lot has changed, so I'm trying those new steps.
What do you get when you run nvcc --version
?
You can also run dpkg -l | grep -i cuda
to view what CUDA packages have been installed.
If secure boot is not a hard requirement, you can always use our HPC images in marketplace, which have commonly used HPC packages and libraries already installed for you, including NVIDIA/CUDA drivers. Currently we have Ubuntu and Alma HPC images available. Just FYI.
I think the nvidia driver is still not loading because of secure boot, I get this:
(sr) smartreplayuser@srtrainerv2:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
as for nvcc
I get Command 'nvcc' not found
and when I try and install it I get these errors:
sudo apt install nvidia-cuda-toolkit
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:
The following packages have unmet dependencies:
libcuinj64-11.5 : Depends: libnvidia-compute-495 (>= 495) but it is not going to be installed or
libnvidia-compute-495-server (>= 495) but it is not installable or
libcuda.so.1 (>= 495) or
libcuda-11.5-1
libnvidia-ml-dev : Depends: libnvidia-compute-495 (>= 495) but it is not going to be installed or
libnvidia-compute-495-server (>= 495) but it is not installable or
libnvidia-ml.so.1 (>= 495)
nvidia-cuda-dev : Depends: libnvidia-compute-495 (>= 495) but it is not going to be installed or
libnvidia-compute-495-server (>= 495) but it is not installable or
libcuda.so.1 (>= 495) or
libcuda-11.5-1
Recommends: libnvcuvid1 but it is not installable
E: Unable to correct problems, you have held broken packages.
The reason I'm pushing on this is I see a lot of chatter online about this and if we can't get a reliable setup on secure boot we should tell people not to try rather than given a bunch of instructions that don't work.
Could you please do a fresh installation by deleting the VM and launching a new one? It seems like you current VM has messed up different installations.
sudo apt install nvidia-cuda-toolkit
is not recommended since it will
downgrade your CUDA version to 10.
Your reported errors are seen in my debugging process and have been addressed in the latest instructions. Could you please try them out in a brand new VM?
Thanks!
On Thu, Jul 20, 2023 at 8:24 PM Chris Lovett @.***> wrote:
The reason I'm pushing on this is I see a lot of chatter online about this and if we can't get a reliable setup on secure boot we should tell people not to try rather than given a bunch of instructions that don't work.
— Reply to this email directly, view it on GitHub https://github.com/MicrosoftDocs/azure-docs/issues/111536#issuecomment-1644817510, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVCU2Z67EPOZKWSYOQMCEDXRHD27ANCNFSM6AAAAAAZXM5RK4 . You are receiving this because you were mentioned.Message ID: @.***>
-- Li
Yes, I'll start again from scratch and let you know how it goes.
Here's what I found:
Install CUDA driver on Ubuntu with Secure Boot enabled
to the toc at the top of the pagesudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/$distro/$arch/3bf863cc.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/$distro/$arch/ /"
What do I do about this question that pops up on the keyring install?
Configuration file '/etc/apt/preferences.d/cuda-repository-pin-600'
==> File on system created by you or by a script.
==> File also in package provided by package maintainer.
What would you like to do about it ? Your options are:
Y or I : install the package maintainer's version
N or O : keep your currently-installed version
D : show the differences between the versions
Z : start a shell to examine the situation
The default action is to keep your current version.
*** cuda-repository-pin-600 (Y/I/N/O/D/Z) [default=N] ?
*** cuda-repository-pin-600 (Y/I/N/O/D/Z) [default=N] ? --- /etc/apt/preferences.d/cuda-repository-pin-600 2023-07-21 01:44:35.520199269 +0000
+++ /etc/apt/preferences.d/cuda-repository-pin-600.dpkg-new 2023-04-20 00:36:02.000000000 +0000
@@ -1,15 +1,11 @@
Package: nsight-compute
Pin: origin *ubuntu.com*
Pin-Priority: -1
+
Package: nsight-systems
Pin: origin *ubuntu.com*
Pin-Priority: -1
-Package: nvidia-modprobe
-Pin: release l=NVIDIA CUDA
-Pin-Priority: 600
-Package: nvidia-settings
-Pin: release l=NVIDIA CUDA
-Pin-Priority: 600
+
Package: *
Pin: release l=NVIDIA CUDA
-Pin-Priority: 100
+Pin-Priority: 600
sudo apt-get install nvidia-gds
and can sudo apt-get install nvidia-gds be done before reboot? Since it is doing kernel preparation, I'm guessing it has to be done before reboot...nvidia-smi
says:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Yes, it works for me if select "Y". I can redo once to confirm.
Ok, I'll have to try one more time from scratch then. I'd really love to get it working... did you also verify pytorch works?
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
python
>>> import torch
>>> torch.cuda.is_available()
True
litan2@litan2ubuntuSBtest:~$ nvidia-smi
Fri Jul 21 02:49:21 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-PCIE-16GB Off | 00000001:00:00.0 Off | 0 |
| N/A 35C P0 25W / 250W | 99MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 853 G /usr/lib/xorg/Xorg 82MiB |
| 0 N/A N/A 1016 G /usr/bin/gnome-shell 16MiB |
+---------------------------------------------------------------------------------------+
No, I did not install Python so did not verify CUDA in Python. nvcc --version
confirms the correct installation.
FYI. I am using Ubuntu Server 20.04 LTS on Standard_NC6s_v3 VM size at South Central US region.
Yay, it worked this time!
Fri Jul 21 02:54:25 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla M60 Off | 00000001:00:00.0 Off | Off |
| N/A 22C P8 13W / 150W | 0MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Thanks so much for all your help and for improving the docs also.
The instructions are not working for Azure VM with Ubuntu 20.04 with size Standard NC16as T4 v3
Specifically, instructions for installing the GPU drivers with Secure Boot on Ubuntu 20.04 are not working. I was able to get the extension installed successfully when the secure boot is disabled.
Document Details
⚠ Do not edit this section. It is required for learn.microsoft.com ➟ GitHub issue linking.