Closed visheshtanksale closed 3 weeks ago
This PR updates the Sandbox Device Plugin Daemonset with a volume mount to support the update to GPU healthcheck added here
Details about the update to health check from the above mentioned PR are below
For a GPU configured as passthrough , device plugin does not update the GPU count on the node when a GPU falls off the bus.
To reproduce follow the steps
Remove the GPU from the bus echo "1" > /sys/bus/pci/devices//remove
Validated the GPU is no longer visible from the host using lspci lspci -nnk -d 10de:
The number of GPUs exposed on k8s node doesn't change.
Watching for iommu groups under /dev/vfio creates a fsnotify when the GPU falls off the bus
This PR updates the Sandbox Device Plugin Daemonset with a volume mount to support the update to GPU healthcheck added here
Details about the update to health check from the above mentioned PR are below
For a GPU configured as passthrough , device plugin does not update the GPU count on the node when a GPU falls off the bus.
To reproduce follow the steps
Remove the GPU from the bus echo "1" > /sys/bus/pci/devices//remove
Validated the GPU is no longer visible from the host using lspci lspci -nnk -d 10de:
The number of GPUs exposed on k8s node doesn't change.
Watching for iommu groups under /dev/vfio creates a fsnotify when the GPU falls off the bus