NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.8k stars 292 forks source link

How vGPU get licensed when using gpu-operator #628

Open yuzs2 opened 10 months ago

yuzs2 commented 10 months ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior. Hi, I'm trying to deploy gpu-operator on my k8s cluster, whose vGPU node is coming from Vsphere(VMware ESXi, 8). I want to use my vCS license (I have DLS instance), so I'm following the document: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator-vgpu.html However, after the deploy, I can only see unlicensed when running nvidia-smi -q on either the workload pods/nodes. (on node, even no nvidia-smi installed)

3. Steps to reproduce the issue

Detailed steps to reproduce the issue.

  1. I have 535.54.06 driver on my Exsi host:
    
    [root@exsi:~] nvidia-smi
    Fri Dec  8 07:35:02 2023       
    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 535.54.06              Driver Version: 535.54.06    CUDA Version: N/A      |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
    |                                         |                      |               MIG M. |
    |=========================================+======================+======================|
    |   0  NVIDIA A100X                   On  | 00000000:B5:00.0 Off |                    0 |
    | N/A   38C    P0              70W / 300W |  80896MiB / 81920MiB |      0%      Default |
    |                                         |                      |             Disabled |
    +-----------------------------------------+----------------------+----------------------+
    |   1  NVIDIA A100X                   On  | 00000000:DE:00.0 Off |                    0 |
    | N/A   35C    P0              69W / 300W |  40448MiB / 81920MiB |      0%      Default |
    |                                         |                      |             Disabled |
    +-----------------------------------------+----------------------+----------------------+
3. I have DLS instance which is using vCS license. I generated the client_configuration_token.tok, and confirmed this token can work fine with the gridd.conf set `FeatureType=4` in another legency k8s cluster.
4. I deployed the gpu-operator following the document: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator-vgpu.html, (when building the driver container, I was using the driver: NVIDIA-Linux-x86_64-535.54.03-grid.run). The deployment seems fine:

$ k -n gpu-operator get cm,deploy,statefulset,daemonset,pods NAME DATA AGE configmap/default-gpu-clients 1 92d configmap/default-mig-parted-config 1 92d configmap/gpu-operator-node-feature-discovery-worker-conf 1 92d configmap/kube-root-ca.crt 1 92d configmap/licensing-config 2 78m

NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/gpu-operator 1/1 1 1 92d deployment.apps/gpu-operator-node-feature-discovery-master 1/1 1 1 92d

NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/gpu-feature-discovery 1 1 1 1 1 nvidia.com/gpu.deploy.gpu-feature-discovery=true 92d daemonset.apps/gpu-operator-node-feature-discovery-worker 4 4 4 4 4 92d daemonset.apps/nvidia-container-toolkit-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.container-toolkit=true 92d daemonset.apps/nvidia-dcgm-exporter 1 1 1 1 1 nvidia.com/gpu.deploy.dcgm-exporter=true 92d daemonset.apps/nvidia-device-plugin-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.device-plugin=true 92d daemonset.apps/nvidia-driver-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.driver=true 92d daemonset.apps/nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 92d daemonset.apps/nvidia-operator-validator 1 1 1 1 1 nvidia.com/gpu.deploy.operator-validator=true 92d

NAME READY STATUS RESTARTS AGE pod/gpu-feature-discovery-nvphm 1/1 Running 0 71m pod/gpu-operator-6ddf8d789d-szqmq 1/1 Running 0 72m pod/gpu-operator-node-feature-discovery-master-59b4b67f4f-r4fqw 1/1 Running 0 72m pod/gpu-operator-node-feature-discovery-worker-bk7tm 1/1 Running 0 72m pod/gpu-operator-node-feature-discovery-worker-d95vb 1/1 Running 0 72m pod/gpu-operator-node-feature-discovery-worker-j74sr 1/1 Running 0 72m pod/gpu-operator-node-feature-discovery-worker-s65pw 1/1 Running 0 72m pod/gpu-operator-node-feature-discovery-worker-x67jn 1/1 Terminating 0 92d pod/nvidia-container-toolkit-daemonset-2wc8f 1/1 Running 0 71m pod/nvidia-cuda-validator-glwxb 0/1 Completed 0 63m pod/nvidia-dcgm-exporter-z9txp 1/1 Running 0 71m pod/nvidia-device-plugin-daemonset-n8z4r 1/1 Running 0 71m pod/nvidia-device-plugin-validator-sd6p7 0/1 Completed 0 62m pod/nvidia-driver-daemonset-hmcvq 1/1 Running 0 71m pod/nvidia-operator-validator-g6msn 1/1 Running 0 71m


6. Then I started a workload pod which just run a `nvidia-smi -q`, but in pod logs I can only see `unlicensed`:

vGPU Software Licensed Product Product Name : NVIDIA Virtual Compute Server License Status : Unlicensed (Restricted)

I also ssh into the GPU node, and tried to run `nvidia-smi -q`, but it told `nvidia-smi` is not installed.
Btw, this is the output for `nvidia-smi` in pod:

root@pod:/# nvidia-smi Fri Dec 8 07:16:35 2023
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 GRID A100D-40C On | 00000000:02:00.0 Off | 0 | | N/A N/A P0 N/A / N/A | 0MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+


7. I realized that vGPU 16 may not support vCS license system, so I re-create the `licensing-config` configmap with the client_configuration_token.tok from a NVAIE DLS instance, and set `FeatureType=1` in the gridd.conf (they can work in other k8s cluster), and then restarted all the gpu-operator pods. But I can still see the same output in the last step. (pretty weird)

### 4. Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/) (optional if deemed irrelevant)

 - [x] kubernetes pods status: `kubectl get pods -n OPERATOR_NAMESPACE` 
 provided in the above
 - [x] kubernetes daemonset status: `kubectl get ds -n OPERATOR_NAMESPACE`
 provided in the above
 - [x] If a pod/ds is in an error state or pending state `kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME`
 No
 - [x] If a pod/ds is in an error state or pending state `kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers`
 No
 - [x] Output from running `nvidia-smi` from the driver container: `kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi`

➭ k -n gpu-operator exec -c nvidia-driver-ctr nvidia-driver-daemonset-hmcvq -- nvidia-smi Fri Dec 8 08:03:50 2023
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 GRID A100D-40C On | 00000000:02:00.0 Off | 0 | | N/A N/A P0 N/A / N/A | 0MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+

➭ k -n gpu-operator exec -c nvidia-driver-ctr nvidia-driver-daemonset-hmcvq -- nvidia-smi -q vGPU Software Licensed Product Product Name : NVIDIA Virtual Compute Server License Status : Unlicensed (Restricted)

 - [x] containerd logs `journalctl -u containerd > containerd.log`

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh chmod +x must-gather.sh ./must-gather.sh containerd.log


**NOTE**: please refer to the [must-gather](https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh) script for debug data collected.

This bundle can be submitted to us via email: **operator_feedback@nvidia.com**
shivamerla commented 10 months ago

@yuzs2 can you check for any errors in the dmesg from nvidia-gridd. dmesg | grep -i gridd