NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.53k stars 264 forks source link

Adding volume mount to sandbox DP to support GPU healthcheck #727

Closed visheshtanksale closed 3 weeks ago

visheshtanksale commented 1 month ago

This PR updates the Sandbox Device Plugin Daemonset with a volume mount to support the update to GPU healthcheck added here

Details about the update to health check from the above mentioned PR are below

For a GPU configured as passthrough , device plugin does not update the GPU count on the node when a GPU falls off the bus.

To reproduce follow the steps

Remove the GPU from the bus echo "1" > /sys/bus/pci/devices//remove

Validated the GPU is no longer visible from the host using lspci lspci -nnk -d 10de:

The number of GPUs exposed on k8s node doesn't change.

Watching for iommu groups under /dev/vfio creates a fsnotify when the GPU falls off the bus