NVIDIA / go-dcgm

Golang bindings for Nvidia Datacenter GPU Manager (DCGM)
Apache License 2.0
96 stars 27 forks source link

Add new API ListenForPolicyViolations to replace Policy #54

Closed dran-dev closed 10 months ago

dran-dev commented 10 months ago

Summary of Changes

This pull request introduces enhancements to the Policy API in the go-dcgm bindings for NVIDIA Data Center GPU Manager (DCGM) library. The primary modifications include the deprecation of the existing Policy API and the introduction of the ListenForPolicyViolations API. The new API enables users to set policies for all GPUs collectively, eliminating the need to configure individual GPUs separately. Additionally, the ListenForPolicyViolations API allows users to register and monitor policy violations across all GPUs concurrently, addressing usability constraints and making the API more efficient.

  1. Deprecation of Policy API:

The existing Policy API has been deprecated due to usability limitations in managing policies for multiple GPUs individually.

  1. Introduction of ListenForPolicyViolations API:

    The new API provides a more user-friendly interface for setting policies across all GPUs with a single call, streamlining the configuration process. Policy callbacks can now be registered once during the program's lifetime, simplifying the integration of policy violation monitoring into applications.

Context and Rationale

The decision to deprecate the Policy API and introduce ListenForPolicyViolations stems from usability constraints and the recognition that monitoring policy violations for individual GPUs at a time may not be useful in most scenarios. The changes aim to improve the overall usability and efficiency of policy callback registration with the DCGM library.

Deprecation Notice

Developers are advised to migrate from the deprecated Policy API to the new ListenForPolicyViolations API for improved functionality and to ensure compatibility with future releases.