The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
Briefly explain the issue in terms of expected behavior and current behavior.
Hi, I'm trying to deploy gpu-operator on my k8s cluster, whose vGPU node is coming from Vsphere(VMware ESXi, 8). I want to use my vCS license (I have DLS instance), so I'm following the document: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator-vgpu.html
However, after the deploy, I can only see unlicensed when running nvidia-smi -q on either the workload pods/nodes. (on node, even no nvidia-smi installed)
3. I have DLS instance which is using vCS license. I generated the client_configuration_token.tok, and confirmed this token can work fine with the gridd.conf set `FeatureType=4` in another legency k8s cluster.
4. I deployed the gpu-operator following the document: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator-vgpu.html, (when building the driver container, I was using the driver: NVIDIA-Linux-x86_64-535.54.03-grid.run). The deployment seems fine:
$ k -n gpu-operator get cm,deploy,statefulset,daemonset,pods
NAME DATA AGE
configmap/default-gpu-clients 1 92d
configmap/default-mig-parted-config 1 92d
configmap/gpu-operator-node-feature-discovery-worker-conf 1 92d
configmap/kube-root-ca.crt 1 92d
configmap/licensing-config 2 78m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/gpu-operator 1/1 1 1 92d
deployment.apps/gpu-operator-node-feature-discovery-master 1/1 1 1 92d
6. Then I started a workload pod which just run a `nvidia-smi -q`, but in pod logs I can only see `unlicensed`:
vGPU Software Licensed Product
Product Name : NVIDIA Virtual Compute Server
License Status : Unlicensed (Restricted)
I also ssh into the GPU node, and tried to run `nvidia-smi -q`, but it told `nvidia-smi` is not installed.
Btw, this is the output for `nvidia-smi` in pod:
7. I realized that vGPU 16 may not support vCS license system, so I re-create the `licensing-config` configmap with the client_configuration_token.tok from a NVAIE DLS instance, and set `FeatureType=1` in the gridd.conf (they can work in other k8s cluster), and then restarted all the gpu-operator pods. But I can still see the same output in the last step. (pretty weird)
### 4. Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/) (optional if deemed irrelevant)
- [x] kubernetes pods status: `kubectl get pods -n OPERATOR_NAMESPACE`
provided in the above
- [x] kubernetes daemonset status: `kubectl get ds -n OPERATOR_NAMESPACE`
provided in the above
- [x] If a pod/ds is in an error state or pending state `kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME`
No
- [x] If a pod/ds is in an error state or pending state `kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers`
No
- [x] Output from running `nvidia-smi` from the driver container: `kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi`
➭ k -n gpu-operator exec -c nvidia-driver-ctr nvidia-driver-daemonset-hmcvq -- nvidia-smi
Fri Dec 8 08:03:50 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 GRID A100D-40C On | 00000000:02:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
➭ k -n gpu-operator exec -c nvidia-driver-ctr nvidia-driver-daemonset-hmcvq -- nvidia-smi -q
vGPU Software Licensed Product
Product Name : NVIDIA Virtual Compute Server
License Status : Unlicensed (Restricted)
**NOTE**: please refer to the [must-gather](https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh) script for debug data collected.
This bundle can be submitted to us via email: **operator_feedback@nvidia.com**
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior. Hi, I'm trying to deploy gpu-operator on my k8s cluster, whose vGPU node is coming from Vsphere(VMware ESXi, 8). I want to use my vCS license (I have DLS instance), so I'm following the document: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator-vgpu.html However, after the deploy, I can only see
unlicensed
when runningnvidia-smi -q
on either the workload pods/nodes. (on node, even nonvidia-smi
installed)3. Steps to reproduce the issue
Detailed steps to reproduce the issue.
$ k -n gpu-operator get cm,deploy,statefulset,daemonset,pods NAME DATA AGE configmap/default-gpu-clients 1 92d configmap/default-mig-parted-config 1 92d configmap/gpu-operator-node-feature-discovery-worker-conf 1 92d configmap/kube-root-ca.crt 1 92d configmap/licensing-config 2 78m
NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/gpu-operator 1/1 1 1 92d deployment.apps/gpu-operator-node-feature-discovery-master 1/1 1 1 92d
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/gpu-feature-discovery 1 1 1 1 1 nvidia.com/gpu.deploy.gpu-feature-discovery=true 92d daemonset.apps/gpu-operator-node-feature-discovery-worker 4 4 4 4 4 92d
daemonset.apps/nvidia-container-toolkit-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.container-toolkit=true 92d
daemonset.apps/nvidia-dcgm-exporter 1 1 1 1 1 nvidia.com/gpu.deploy.dcgm-exporter=true 92d
daemonset.apps/nvidia-device-plugin-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.device-plugin=true 92d
daemonset.apps/nvidia-driver-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.driver=true 92d
daemonset.apps/nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 92d
daemonset.apps/nvidia-operator-validator 1 1 1 1 1 nvidia.com/gpu.deploy.operator-validator=true 92d
NAME READY STATUS RESTARTS AGE pod/gpu-feature-discovery-nvphm 1/1 Running 0 71m pod/gpu-operator-6ddf8d789d-szqmq 1/1 Running 0 72m pod/gpu-operator-node-feature-discovery-master-59b4b67f4f-r4fqw 1/1 Running 0 72m pod/gpu-operator-node-feature-discovery-worker-bk7tm 1/1 Running 0 72m pod/gpu-operator-node-feature-discovery-worker-d95vb 1/1 Running 0 72m pod/gpu-operator-node-feature-discovery-worker-j74sr 1/1 Running 0 72m pod/gpu-operator-node-feature-discovery-worker-s65pw 1/1 Running 0 72m pod/gpu-operator-node-feature-discovery-worker-x67jn 1/1 Terminating 0 92d pod/nvidia-container-toolkit-daemonset-2wc8f 1/1 Running 0 71m pod/nvidia-cuda-validator-glwxb 0/1 Completed 0 63m pod/nvidia-dcgm-exporter-z9txp 1/1 Running 0 71m pod/nvidia-device-plugin-daemonset-n8z4r 1/1 Running 0 71m pod/nvidia-device-plugin-validator-sd6p7 0/1 Completed 0 62m pod/nvidia-driver-daemonset-hmcvq 1/1 Running 0 71m pod/nvidia-operator-validator-g6msn 1/1 Running 0 71m
vGPU Software Licensed Product Product Name : NVIDIA Virtual Compute Server License Status : Unlicensed (Restricted)
root@pod:/# nvidia-smi Fri Dec 8 07:16:35 2023
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 GRID A100D-40C On | 00000000:02:00.0 Off | 0 | | N/A N/A P0 N/A / N/A | 0MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+
➭ k -n gpu-operator exec -c nvidia-driver-ctr nvidia-driver-daemonset-hmcvq -- nvidia-smi Fri Dec 8 08:03:50 2023
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 GRID A100D-40C On | 00000000:02:00.0 Off | 0 | | N/A N/A P0 N/A / N/A | 0MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+
➭ k -n gpu-operator exec -c nvidia-driver-ctr nvidia-driver-daemonset-hmcvq -- nvidia-smi -q vGPU Software Licensed Product Product Name : NVIDIA Virtual Compute Server License Status : Unlicensed (Restricted)
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh chmod +x must-gather.sh ./must-gather.sh containerd.log