NVIDIA / go-dcgm

Golang bindings for Nvidia Datacenter GPU Manager (DCGM)
Apache License 2.0
96 stars 27 forks source link

DCGM Policy Violation Notification channel reporting too many PCIe violations on P5 #72

Closed haardm closed 3 months ago

haardm commented 3 months ago

Hi team,

We are observing PCIe violations on p5 happen consistently fromt the time the instance is launched. We act on this with a terminate and replace instance logic, but that's an expensive operation timewise as well as the instance type being a P5 from EC2. Also, this is a default set threshold and can't be managed by client while subscribing to the policy.

Few asks:

  1. Is there an upstream fix from Nvidia that is planned?
  2. Is there any repercussion of temporarily not subscribing to this policy?
  3. What would go wrong if we let the PCIe errors to keep happening silently?
glowkey commented 3 months ago

I believe this should be reposted in the DCGM project as all the logic for this happens within the DCGM library and not the go bindings.