This PR adds the ability to shut down faulty GPUs by setting the variable nvidia_drain_devices at the host level. If the variable is defined, our nvidia role creates a boot-time service which passes the device to nvidia-smi drain. As a result, the device is not advertised anymore as a CUDA device but it's still visible to lspci, which means it's hidden to end-user programs but an administrator can run validation routines on it.
This PR adds the ability to shut down faulty GPUs by setting the variable
nvidia_drain_devices
at the host level. If the variable is defined, ournvidia
role creates a boot-time service which passes the device tonvidia-smi drain
. As a result, the device is not advertised anymore as a CUDA device but it's still visible tolspci
, which means it's hidden to end-user programs but an administrator can run validation routines on it.