DCGM Policy Violation Notification channel reporting too many PCIe violations on P5

Hi team,

We are observing PCIe violations on p5 happen consistently fromt the time the instance is launched. We act on this with a terminate and replace instance logic, but that's an expensive operation timewise as well as the instance type being a P5 from EC2. Also, this is a default set threshold and can't be managed by client while subscribing to the policy.

Few asks:

Is there an upstream fix from Nvidia that is planned?
Is there any repercussion of temporarily not subscribing to this policy?
What would go wrong if we let the PCIe errors to keep happening silently?

NVIDIA / go-dcgm

DCGM Policy Violation Notification channel reporting too many PCIe violations on P5 #72