Open asm582 opened 9 months ago
Hmm, this is unexpected. We just pull the MIG state directly from NVML (the underlying library that nvidia-smi
uses as well).
Is it possible that the plugin came online when only one was enabled and the other wasn't? The plugin doesn't do any real-time reconciliation the GPU state -- the only way to get it to update is to restart the plugin.
So can you try restarting the plugin?
And ff that doesn't work, can you try deleting the NAS object and then restarting the plugin? This shouldn't be necessary, but I'm curious if it then resolves the issue or not.
Thanks, I will delete the cluster and create plus reinstall the dra driver.
update: recreated KinD cluster and re-deployed the previously built driver image, but still no luck:
[root@nvd-srv-02 k8s-dra-driver]# kubectl get pods -n nvidia-dra-driver
NAME READY STATUS RESTARTS AGE
nvidia-k8s-dra-driver-controller-6d6b45756-47khb 1/1 Running 0 2m
nvidia-k8s-dra-driver-kubelet-plugin-fkz4d 1/1 Running 0 2m
Spec:
Allocatable Devices:
Gpu:
Architecture: Ampere
Brand: Nvidia
Cuda Compute Capability: 8.0
Index: 1
Memory Bytes: 85899345920
Mig Enabled: false
Product Name: NVIDIA A100 80GB PCIe
Uuid: GPU-713eebac-08df-c534-6c98-8d5055ca97a9
Gpu:
Architecture: Ampere
Brand: Nvidia
Cuda Compute Capability: 8.0
Index: 0
Memory Bytes: 85899345920
Mig Enabled: true
Product Name: NVIDIA A100 80GB PCIe
Uuid: GPU-1a9afbae-5932-54f8-c2c4-a863888d45bb
Could you confirm that running nvidia-smi
in the kind worker node shows MIG as enabled?
Thanks, could you please recommend the container image that I should use to run the command?
Running:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0141a7534ebf kindest/node:v1.27.1-v20230515-01914134-containerd_v1.7.1 "/usr/local/bin/entr…" 3 weeks ago Up 3 weeks 127.0.0.1:44521->6443/tcp k8s-dra-driver-cluster-control-plane
255a4db134af kindest/node:v1.27.1-v20230515-01914134-containerd_v1.7.1 "/usr/local/bin/entr…" 3 weeks ago Up 3 weeks k8s-dra-driver-cluster-worker
shows the kind nodes created by the demo.
Running:
$ docker exec -ti k8s-dra-driver-cluster-worker nvidia-smi
is equivalent to running nvidia-smi
on a k8s node. The containerized kind worker node in this case.
Thank you for sharing the command, below is the command output:
[root@nvd-srv-02 k8s-dra-driver]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b5609ebd1675 kindest/node:v1.27.1-v20230515-01914134-containerd_v1.7.1 "/usr/local/bin/entr…" About a minute ago Up About a minute 127.0.0.1:34917->6443/tcp k8s-dra-driver-cluster-control-plane
5ef8b180a289 kindest/node:v1.27.1-v20230515-01914134-containerd_v1.7.1 "/usr/local/bin/entr…" About a minute ago Up About a minute k8s-dra-driver-cluster-worker
[root@nvd-srv-02 k8s-dra-driver]# docker exec -ti k8s-dra-driver-cluster-worker nvidia-smi
Mon Dec 4 14:29:52 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000000:17:00.0 Off | On |
| N/A 36C P0 45W / 300W | 0MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe On | 00000000:65:00.0 Off | On |
| N/A 35C P0 46W / 300W | 0MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| No MIG devices found |
+---------------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
[root@nvd-srv-02 k8s-dra-driver]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-dra-driver-cluster-control-plane Ready control-plane 2m36s v1.27.1
k8s-dra-driver-cluster-worker Ready <none> 2m12s v1.27.1
[root@nvd-srv-02 k8s-dra-driver]# kubectl describe nas/k8s-dra-driver-cluster-worker -n nvidia-dra-driver
Name: k8s-dra-driver-cluster-worker
Namespace: nvidia-dra-driver
Labels: <none>
Annotations: <none>
API Version: nas.gpu.resource.nvidia.com/v1alpha1
Kind: NodeAllocationState
Metadata:
Creation Timestamp: 2023-12-04T14:29:09Z
Generation: 4
Owner References:
API Version: v1
Kind: Node
Name: k8s-dra-driver-cluster-worker
UID: ddb095d1-a608-4f70-a7b2-bc55ad81ed4c
Resource Version: 587
UID: 863efe97-f965-4f42-9816-88e5fc3bb860
Spec:
Allocatable Devices:
Gpu:
Architecture: Ampere
Brand: Nvidia
Cuda Compute Capability: 8.0
Index: 0
Memory Bytes: 85899345920
Mig Enabled: true
Product Name: NVIDIA A100 80GB PCIe
Uuid: GPU-1a9afbae-5932-54f8-c2c4-a863888d45bb
Gpu:
Architecture: Ampere
Brand: Nvidia
Cuda Compute Capability: 8.0
Index: 1
Memory Bytes: 85899345920
Mig Enabled: false
Product Name: NVIDIA A100 80GB PCIe
Uuid: GPU-713eebac-08df-c534-6c98-8d5055ca97a9
Mig:
Parent Product Name: NVIDIA A100 80GB PCIe
Placements:
Size: 1
Start: 0
Size: 1
Start: 1
Size: 1
Start: 2
Size: 1
Start: 3
Size: 1
Start: 4
Size: 1
Start: 5
Size: 1
Start: 6
Profile: 1g.10gb+me
Mig:
Parent Product Name: NVIDIA A100 80GB PCIe
Placements:
Size: 2
Start: 0
Size: 2
Start: 2
Size: 2
Start: 4
Size: 2
Start: 6
Profile: 1g.20gb
Mig:
Parent Product Name: NVIDIA A100 80GB PCIe
Placements:
Size: 1
Start: 0
Size: 1
Start: 1
Size: 1
Start: 2
Size: 1
Start: 3
Size: 1
Start: 4
Size: 1
Start: 5
Size: 1
Start: 6
Profile: 1g.10gb
Mig:
Parent Product Name: NVIDIA A100 80GB PCIe
Placements:
Size: 2
Start: 0
Size: 2
Start: 2
Size: 2
Start: 4
Profile: 2g.20gb
Mig:
Parent Product Name: NVIDIA A100 80GB PCIe
Placements:
Size: 4
Start: 0
Size: 4
Start: 4
Profile: 3g.40gb
Mig:
Parent Product Name: NVIDIA A100 80GB PCIe
Placements:
Size: 4
Start: 0
Profile: 4g.40gb
Mig:
Parent Product Name: NVIDIA A100 80GB PCIe
Placements:
Size: 8
Start: 0
Profile: 7g.80gb
Status: Ready
Events: <none>
As seen nvidia-smi and nas do not agree.
one thing to note is that docker exec -ti k8s-dra-driver-cluster-worker nvidia-smi
takes a long time to execute about 12 seconds.
Can this be closed?
I have enabled MIG mode on both GPUs on a single node but the nas object shows one of the GPUs is not mig enabled:
output of nvidia-smi:
Can you please share how can nas object be updated correctly?