NVIDIA / go-dcgm

Golang bindings for Nvidia Datacenter GPU Manager (DCGM)
Apache License 2.0
96 stars 27 forks source link

feat: expose function for listening to policy violations on a specific GPU group #73

Closed sanjams2 closed 2 months ago

sanjams2 commented 2 months ago

== Motivation ==

Enable finer grained GPU policy violation tracking

== Details ==

The current go-dcgm library exposes a way to listen to policy violations across all GPUs. While this is useful, it does not currently help with identifying exactly which GPUs are experiencing issues. Ideally, the policy violation would contain identifying GPU information, but it seems today it does not (struct definitions). So instead, it would be useful if users could listen to policy violations on groups created for specific GPUs. This would allow users to then know when specific GPUs were experiencing issues.

This change exposes a new function, ListenForPolicyViolationsForGroup, which takes a GroupHandle passed by the user and listens to policy violations for that group. It also modifies ListenForPolicyViolations to use this new function, but with specifying the group for all GPUs — so no net change in behavior.

Signed-off-by: sanjams2 sanjams2@users.noreply.github.com

nvvfedorov commented 2 months ago

@sanjams2 , Thank you for the PR. Please sign your PR: https://github.com/NVIDIA/go-dcgm/blob/main/CONTRIBUTING.md.