NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
387 stars 50 forks source link

a question about dcgm policy listening for xid #162

Open BetaZYN opened 5 months ago

BetaZYN commented 5 months ago

If I register an XID through DCGM's policy and listen, when a certain XID (for example, 79) occurs, will the policy keep reporting that XID until it recovers, or will it only report it once? I look forward to your reply

nikkon-dev commented 5 months ago

@BetaZYN,

It depends on how you read the XIDs. Each XID event is stored with its timestamp, and there is an API to get either the latest value in the TSDB or values since a specific timestamp. The dcgmi cli tool uses only the last value in the TSDB, so it may look like a "sticky" XID until another XID is reported. If you use the API directly, you may get all XIDs that happened within the last minute, for example.

Currently, the DCGM version can't report XID 79, 119, and 120 due to limitations in the NVML library. Our team is working to fix this.

BetaZYN commented 5 months ago

@nikkon-dev , Thank you for your reply.

  1. Which specific API are you referring to that can get xid and timestamp?
  2. Our current use case is as follows: // set group 2 policy condition with XID errors dcgmi policy -g 2 --set 0,0 -x // register group2 for policy updates dcgmi policy -g 2 --reg If a GPU generates an XID during listening, will this XID be repeatedly reported until a new XID appears or until this XID disappears?