Closed davecore82 closed 1 year ago
I installed the nvidia drivers and utils just to run nvidia-smi to get more information:
ubuntu@blanka:~$ nvidia-smi
Tue Mar 23 13:06:13 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB Off | 00000000:07:00.0 Off | 0 |
| N/A 26C P0 43W / 400W | 4MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-SXM4-40GB Off | 00000000:0F:00.0 Off | 0 |
| N/A 25C P0 44W / 400W | 4MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 A100-SXM4-40GB Off | 00000000:47:00.0 Off | 0 |
| N/A 27C P0 44W / 400W | 4MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 A100-SXM4-40GB Off | 00000000:4E:00.0 Off | 0 |
| N/A 26C P0 40W / 400W | 4MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 A100-SXM4-40GB Off | 00000000:87:00.0 Off | 0 |
| N/A 30C P0 42W / 400W | 4MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 A100-SXM4-40GB Off | 00000000:90:00.0 Off | 0 |
| N/A 29C P0 45W / 400W | 4MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 A100-SXM4-40GB Off | 00000000:B7:00.0 Off | 0 |
| N/A 29C P0 42W / 400W | 4MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 A100-SXM4-40GB Off | 00000000:BD:00.0 Off | 0 |
| N/A 30C P0 45W / 400W | 4MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 4916 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 4916 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 4916 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 4916 G /usr/lib/xorg/Xorg 4MiB |
| 4 N/A N/A 4916 G /usr/lib/xorg/Xorg 4MiB |
| 5 N/A N/A 4916 G /usr/lib/xorg/Xorg 4MiB |
| 6 N/A N/A 4916 G /usr/lib/xorg/Xorg 4MiB |
| 7 N/A N/A 4916 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
I found a recipe that works for me. I'm using microk8s on Ubuntu 20.04, here's what I had to do to make this work on both a Nvidia DGX A100 as well as a ProLiant DL380 Gen10 machine with T4 GPU.
apt purge
the nvidia packages if you can't)modprobe.blacklist=nouveau nouveau.modeset=0
to GRUB_CMDLINE_LINUX_DEFAULT
in /etc/default/grub
and run sudo update-grub
and reboot)nvidia-fabricmanager-460
from the cuda repos (you won't be able to enable the systemd service until the K8s GPU operator has loaded the drivers)microk8s enable dns
(tip: make sure your DNS is working by launching a test pod and resolving internal and external hostnames)microk8s enable gpu
sudo systemctl --now enable nvidia-fabricmanager
)FYI, I did a write up on my adventures with microk8s and MIG on the A100 https://discuss.kubernetes.io/t/my-adventures-with-microk8s-to-enable-gpu-and-use-mig-on-a-dgx-a100/15366
I've followed the recipy and got issues with enabling nvidia-fabricmanager:
fabric manager NVIDIA GPU driver interface version 460.91.03 don't match with driver version 460.73.01. Please update with matching NVIDIA driver package.
It looks like microk8s enable gpu
always loads nvidia drivers 460.73.01. Do you guys know how to harmonize the version of fabric manager with the driver version?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This is a follow-up to the issue in https://github.com/ubuntu/microk8s/issues/2115
As described in that issue, version 1.21/beta of microk8s seems to work better to enable gpu. However, the instructions that work on Ubuntu 20.04 on a g3.4xlarge instance on AWS don't work on an Nvidia DGX A100 machine.
sudo snap install microk8s --channel=1.21/beta --classic microk8s enable gpu
I get the following pod in Init:CrashLoopBackOff:
I haven't been able to find useful information yet. Here's a kubectl describe and kubectl logs (with no logs):
There are no nvidia drivers or cuda packages installed on the machine and never were (fresh MAAS deployment):