IKIM-Essen / EMCP-config

IaC configuration of the Essen Medical Computing Platform (EMCP)
BSD 2-Clause "Simplified" License
0 stars 4 forks source link

Feature: drain selected GPUs on boot #140

Closed enasca closed 1 year ago

enasca commented 1 year ago

This PR adds the ability to shut down faulty GPUs by setting the variable nvidia_drain_devices at the host level. If the variable is defined, our nvidia role creates a boot-time service which passes the device to nvidia-smi drain. As a result, the device is not advertised anymore as a CUDA device but it's still visible to lspci, which means it's hidden to end-user programs but an administrator can run validation routines on it.