NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.45k stars 573 forks source link

update nodelabel for config-manger k8s-device-plugin continuing printing error msg, not stop #669

Open aphrodite1028 opened 3 weeks ago

aphrodite1028 commented 3 weeks ago

if i use nvidia.com/device-plugin.config to set config, just set config0 and after minutes set config1.

k8d-device-plugin continuing print msg, not stop

health.go:142] Error waiting for event: ERROR_UNKNOWN; Marking all devices as unhealthy

and I found gpu driver 470.129.06 not have set_default_device_pinned_mem_limit command param if has gpu driver least limit for gpu mem limit and Is it possible to monitor the GPU utilization for each MPS client independently?

aphrodite1028 commented 2 weeks ago

and if we update k8s-device-plugin version ,for example, from 0.15.0 to 0.16.0.rc , some cuda processing instance already running in host machine ad docker. after nvidia-cuda-mps-control container rerunning, nvidia-cuda-mps-server not starting. when i request running cuda processing. error like below

CUDA failure 806: unrecognized error code ; GPU=0

and after i remove all mps client process, deploy a new mps client pod, nvidia-cuda-mps-server start success

if you @elezar can help me what i need to do ? thanks