NVIDIA / nvidia-docker

Build and run Docker containers leveraging NVIDIA GPUs
Apache License 2.0
17.17k stars 2.03k forks source link

Failed to acquire license from license server #1787

Closed wanghm closed 8 months ago

wanghm commented 9 months ago

Hello, We have three worker nodes with vGPU configured in the k8s cluster.

Configured same client token under /etc/nvidia/ClientConfigToken, and configured /etc/nvidia/gridd.conf. Two worker nodes successfully acquired license from DLS license server, one worker node failed to acquire license.

Error message is like: Oct 12 17:22:37 xxxxxxxxxx nvidia-gridd: Failed to update local trusted store - Maximum buffer size exceeded Oct 12 17:22:37 xxxxxxxxxx nvidia-gridd: Failed to register client (2) Oct 12 17:22:37 xxxxxxxxxx nvidia-gridd: Failed to acquire license from XXX.XXX.XXX.XXX

the config file /etc/nvidia/gridd.conf ServerAddress=xxx.xxx.xxx.xxx ServerPort=443 BackupServerAddress=yyy.yyy.yyy.yyy BackupServerPort=443 FeatureType=1 EnableUI=FALSE LicenseInterval=1440 Valid SSL certificate is configured on the License servers.

What does the error message "Failed to update local trusted store - Maximum buffer size exceeded" mean? How to resolve this issue?

############################################# $ nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |

$ nvidia-container-cli -V cli-version: 1.14.2 lib-version: 1.14.2 nvidia-ctk --version

$ nvidia-ctk --version NVIDIA Container Toolkit CLI version 1.14.2 #############################################

Thanks a lot,

elezar commented 9 months ago

Are there any differences between the nodes?

Could you also describe how your are deploying the NVIDIA GPU Driver and configuring the nodes for GPU access? Are you using the NVIDIA GPU Operator, for example?

cc @cdesiniotis @shivamerla since this is vGPU related.

shivamerla commented 9 months ago

After checking with the vGPU team, looks like its due to the 2KB limit we have to store host network information in vGPU 15.2 drivers. Can you confirm if there are large number of network interfaces in the worker node hitting the issue? Team is suggesting to use vGPU 15.3 which avoids this issue as the buffer is increased to 4KB now.

wanghm commented 8 months ago

There is no difference with the tree nodes.

We installed NVIDIA GPU Driver manually by sudo sh NVIDIA-Linux-x86_64-525.105.17-grid.run

We installed NVIDIA container toolkit according this guide: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/1.14.1/index.html

We are not using NVIDIA GPU Operator in this cluster because of the limitation of the environment, there is no private container registry, and they have very strict proxy control (all outbound traffic need white listed).

Only one network interface in each worker node.

We cannot use 15.3 driver since this version is still not certified by the host vendor. Host driver is 15.2.

Thank you very much,

wanghm commented 8 months ago

Successfully acquire the license after reboot the node.

ztelliot commented 8 months ago

I have the same problem with the latest 16.1 driver, and the host is also a k8s node. After turning off kubelet auto-start to prevent cni initialization, the license can be successfully obtained. I think you should probably increase the buffer size a little larger.