Open aphrodite1028 opened 3 weeks ago
and if we update k8s-device-plugin version ,for example, from 0.15.0 to 0.16.0.rc , some cuda processing instance already running in host machine ad docker. after nvidia-cuda-mps-control container rerunning, nvidia-cuda-mps-server not starting. when i request running cuda processing. error like below
CUDA failure 806: unrecognized error code ; GPU=0
and after i remove all mps client process, deploy a new mps client pod, nvidia-cuda-mps-server start success
if you @elezar can help me what i need to do ? thanks
if i use nvidia.com/device-plugin.config to set config, just set config0 and after minutes set config1.
k8d-device-plugin continuing print msg, not stop
health.go:142] Error waiting for event: ERROR_UNKNOWN; Marking all devices as unhealthy
and I found gpu driver 470.129.06 not have set_default_device_pinned_mem_limit command param if has gpu driver least limit for gpu mem limit and Is it possible to monitor the GPU utilization for each MPS client independently?