Closed v-burak closed 1 year ago
My pleasure. Any time!
And as expected:
Python 3.10.12 (main, Jul 5 2023, 18:54:27) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
[celebrate] Padmalatha Somashiandan reacted to your message:
From: Chris Lovett @.> Sent: Friday, July 21, 2023 2:55:20 AM To: MicrosoftDocs/azure-docs @.> Cc: Comment @.***> Subject: Re: [MicrosoftDocs/azure-docs] Extension is not successfully installed when secure boot is ON. (Issue #111536)
Fri Jul 21 02:54:25 2023 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla M60 Off | 00000001:00:00.0 Off | Off | | N/A 22C P8 13W / 150W | 0MiB / 8192MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+
— Reply to this email directly, view it on GitHubhttps://github.com/MicrosoftDocs/azure-docs/issues/111536#issuecomment-1644915509 or unsubscribehttps://github.com/notifications/unsubscribe-authou are receiving this email because you commented on the thread.
Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
https://review.learn.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup?branch=pr-en-us-245529 Trying to access these docs takes me to the microsoft sign in page, and when I sign in with my account I get the message "Need admin approval" and am unable to view the docs. I'm here because I had the nvidia driver installed on my Ubuntu 22.04 VM and CUDA was working (as was the nvidia-smi command) but it stopped after I ran "sudo apt full-upgrade" even though the Azure docs suggest I run it every so often. I have secure boot enabled and the "enroll new MOK" keys window popped up (same one that @lovettchris showed) after I ran "sudo apt full-upgrade". I entered a password twice and continued, but I was unable to enter my password during reboot because I am unable to see the boot screens by connecting to the VM via SSH or serial console. Thus my VM was not able to finish enrolling the new MOK key and third party drivers were disabled, which apparently included my nvidia driver. @darkwhite29 what can I do to fix this? Are there any options besides creating a new VM and installing from scratch? This doesn't seem like an ideal resolution, as we should be able to run a simple "apt upgrade" without breaking our VMs nvidia driver capabilities. Am I missing something here? FYI when I run "nvidia-smi" i get the error message: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Are you able to access this public page: https://review.learn.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup?branch=main#ubuntu
Let me try to reproduce your case before making any suggestions of solutions.
@darkwhite29 Thanks for the hasty response! And no I'm not able to access that page. I get the same "Need admin approval - unverified" message as before. I can't open it in an incognito tab or on a separate device either.
I wasn't able to reproduce your bugs. The steps in the doc work for me fine. I did run sudo apt update
and sudo apt full-upgrade
after enrolling the key and installing CUDA. nvidia-smi
works fine (reboot is needed). Not sure which step went wrong from your side.
Note when you reboot your VM, although you cannot see the reboot process from your terminal, you should be able to ssh to the VM after 3-5 minutes -- the reboot is running in the background.
I really don't know why you need access to the doc -- it's a public website.
For you convenience, the core steps are attached.
@darkwhite29 Ok so I just followed those instructions exactly as they were provided. I didn't get any error messages or anything while running those commands, but when I run "nvidia-smi", unfortunately I get the same error as before: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
I did get the MOK enrollment screen same as before, after I ran "sudo apt update && sudo apt install -y ubuntu-drivers-common". I entered a password twice as prompted. Then proceeded with the rest of the commands. My guess is that my problem stems from the new MOK being generated, and my inability to finish its enrollment by re-entering the password during a reboot because I cannot see the VM's boot screens via SSH or the serial console in Azure. I'm guessing this disables 3rd party drivers as the message states, and that includes the Nvidia drivers.
You wouldn't happen to know of any way I can see the VM's boot screens would you? If I could see the boot screen and re-enter my password to finish enrolling the new MOK, I think the nvidia drivers may start working. The only other solution I can think of is disabling secure boot. Here's a screenshot of the MOK screen that popped up after I ran "sudo apt update && sudo apt install -y ubuntu-drivers-common":
When you followed the steps of installation and ran nvidia-smi
into an error message "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.". Have you then rebooted?
When you install everything and before trying out nvidia-smi
to see if GPU is recognized, you need to reboot to let your changes take effect. If you don't, the later commands you ran may mess up the driver installation and make it not work.
In this case, your VM may already have inappropriately installed driver versions. I suggest you to start from scratch by creating a new VM. I did it from scratch and had no issues.
In this case, your VM may already have inappropriately installed driver versions. I suggest you to start from scratch by creating a new VM. I did it from scratch and had no issues.
@darkwhite29 I bet that would work, but I have a lot of stuff on this VM that would make creating a new one quite a hassle.
If you're correct and I have driver versions issue there might not be another simple option. But for now I'm going to have to keep trying to find some solution. I just don't think creating a new VM from scratch every time I run "apt upgrade" is acceptable for the work I'm trying to do on these VMs. My next move will probably be to disable secure boot and see if I can get the drivers working afterwards.
Regardless, thanks for your help. I appreciate it.
No, you don't need to create a VM every time when you run apt upgrade
, you just need to reboot your VM. After I ran it, I reboot, and nvidia-smi
still works after the reboot.
Considering your case now, you may create a new VM from scratch now (without deleting your existing VM), just to simply verify if my suggested approach works for you or not.
@darkwhite29 I just rebooted my VM, after when I run "nvidia-smi", I still get the message: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
.
Like I keep saying, I believe part of the issue is I am unable to see the boot screen that will prompt me to re-enter the password I provided to finish enrolling the new MOK after the message I screenshotted earlier. If I could somehow see the boot screens, whether via Azure's serial console or some other means, I think I could re-enter my password to enroll the new MOK and "nvidia-smi" would start working.
@darkwhite29 by the way, I don't think I'm the only person experiencing this issue. It seems like the people in this reddit thread are experiencing the exact same one: https://www.reddit.com/r/AZURE/comments/undlna/mokmanager_and_serial_console/
Considering your case now, you may create a new VM from scratch now (without deleting your existing VM), just to simply verify if my suggested approach works for you or not.
That's fair. To be honest I'm pretty sure your fix would work if if I did make a new VM, because I followed those steps when I initially setup this VM and everything worked fine. I think the issue I'm dealing with occurs when you run apt-upgrade after you've already installed the drivers and gotten "nvidia-smi" working. Perhaps it has something to do with a new kernel version being installed and is specific to Ubuntu 22.04.4?
I went through the link you shared (thanks) -- it mentioned 3 options for a solution:
We are technically the support team, since we are experts in using NVIDIA drivers on Azure. I authored the MOK installation doc shared to you.
Unfortunately Ubuntu-HPC images (our team's product as well) don't work with Trusted Launch -- we are working on supporting our HPC images with Trusted Launch -- it's a three-way long-standing collaboration with Canonical and NVIDIA. Solutions set up but ETA is not very clear for now.
I'm not sure your case, but Trusted Launch (enabling secure boot) now is the default option for Azure VMs. You may disable it if it won't affect your use case, but this is not a long-term solution.
Considering your case now, you may create a new VM from scratch now (without deleting your existing VM), just to simply verify if my suggested approach works for you or not.
That's fair. To be honest I'm pretty sure your fix would work if if I did make a new VM, because I followed those steps when I initially setup this VM and everything worked fine. I think the issue I'm dealing with occurs when you run apt-upgrade after you've already installed the drivers and gotten "nvidia-smi" working. Perhaps it has something to do with a new kernel version being installed and is specific to Ubuntu 22.04.4?
I understand your error occurred after running apt upgrade
, nvidia-smi
stopped working. The new MOK set-up process somehow overwrote the previous working key and thus failed nvidia-smi
. From my attempt, running apt upgrade
did give me a Window but not your shown one, I simply entered OK
in that screen to proceed. No issues afterwards.
Given your VM is already messed up in MOK setting, are you able to disable Trusted Launch for your VM to see if it works?
You can unselect Enable secure boot
as highlighted and click Apply
.
I also modified the installation doc to clarify things a bit.
@darkwhite29 thanks for the suggestion, I think trying to get nvidia-smi working by disabling secure boot is a good idea. I'll give it a try soon and let you know how it goes.
@darkwhite29 I did nothing but disable secure boot in Azure, which automatically restarted the VM. I SSHed back into the VM after it was done restarting, ran "nvidia-smi" and boom, it worked!
As expected, congrats.
@darkwhite29 yeah its not an ideal solution but I'm still happy to get it working for now. Thanks for your help and responsiveness!
The instructions are not working for Azure VM with Ubuntu 20.04 with size Standard NC16as T4 v3
Specifically, instructions for installing the GPU drivers with Secure Boot on Ubuntu 20.04 are not working. I was able to get the extension installed successfully when the secure boot is disabled.
Document Details
⚠ Do not edit this section. It is required for learn.microsoft.com ➟ GitHub issue linking.