The current go-dcgm library exposes a way to listen to policy violations across all GPUs. While this is useful, it does not currently help with identifying exactly which GPUs are experiencing issues. Ideally, the policy violation would contain identifying GPU information, but it seems today it does not (struct definitions). So instead, it would be useful if users could listen to policy violations on groups created for specific GPUs. This would allow users to then know when specific GPUs were experiencing issues.
This change exposes a new function, ListenForPolicyViolationsForGroup, which takes a GroupHandle passed by the user and listens to policy violations for that group. It also modifies ListenForPolicyViolations to use this new function, but with specifying the group for all GPUs — so no net change in behavior.
== Motivation ==
Enable finer grained GPU policy violation tracking
== Details ==
The current go-dcgm library exposes a way to listen to policy violations across all GPUs. While this is useful, it does not currently help with identifying exactly which GPUs are experiencing issues. Ideally, the policy violation would contain identifying GPU information, but it seems today it does not (struct definitions). So instead, it would be useful if users could listen to policy violations on groups created for specific GPUs. This would allow users to then know when specific GPUs were experiencing issues.
This change exposes a new function,
ListenForPolicyViolationsForGroup
, which takes aGroupHandle
passed by the user and listens to policy violations for that group. It also modifiesListenForPolicyViolations
to use this new function, but with specifying the group for all GPUs — so no net change in behavior.Signed-off-by: sanjams2 sanjams2@users.noreply.github.com