NVIDIA / kubevirt-gpu-device-plugin

NVIDIA k8s device plugin for Kubevirt
BSD 3-Clause "New" or "Revised" License
233 stars 67 forks source link

Updating healthcheck for passthrough GPUs #105

Closed visheshtanksale closed 5 months ago

visheshtanksale commented 6 months ago

For a GPU configured as passthrough , device plugin does not update the GPU count on the node when a GPU falls off the bus.

To reproduce follow the steps

Remove the GPU from the bus echo "1" > /sys/bus/pci/devices/<gpu_pci_id>/remove

Validated the GPU is no longer visible from the host using lspci lspci -nnk -d 10de:

The number of GPUs exposed on k8s node doesn't change.

Watching for iommu groups under /dev/vfio creates a fsnotify when the GPU falls off the bus

visheshtanksale commented 6 months ago

@rthallisey Please let me know how it looks

rthallisey commented 5 months ago

Looks fine. Thanks