NVIDIA / gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
Apache License 2.0
1.01k stars 301 forks source link

Questions about EventType, EventData, and Xid #150

Closed ruiwen-zhao closed 3 years ago

ruiwen-zhao commented 3 years ago

Hi,

I have some questions about the relations between NVML's event type, event data, and the Xid error codes. I am posting them here to see if someone might have the answers.

  1. When we register event for a device using RegisterEventForDevice, what Xid errors are covered by a certain event type? Does the event type cover all Xid errors that share the bit mask? i.e. when registeringnvmlEventTypeXidCriticalError, which is 8, we will be listening to Xid codes 8, 9, 11, 12, 13, and 24-31, etc?

If so, then why event type nvmlEventTypeDoubleBitEccError (0x0000000000000002LL) does not cover Xid 48? And If not, if there any doc showing what Xid errors are covered by each event type?

  1. When we get an event from WaitForEvents, is the eventData the Xid code for the error? Will the eventData always be present, no matter which eventType it belongs to? I am asking because the API doc says eventData "Stores XID error for the device in the event of nvmlEventTypeXidCriticalError,."
guptaNswati commented 3 years ago

You can find more information about XID errors from here https://docs.nvidia.com/deploy/xid-errors/index.html#topic_4. You can register for multiple events for a given device using the | operator Yes, the event data is the xid code for that event https://docs.nvidia.com/deploy/nvml-api/structnvmlEventData__t.html#structnvmlEventData__t

ruiwen-zhao commented 3 years ago

Thanks @guptaNswati! I guess my question is more of, if I call registerEventForDevice with event type nvmlEventTypeXidCriticalError, will I get an event in case of any Xid errors, or just some of them? If it is just some of them, then what Xid errors are considered an XidCriticalError?

guptaNswati commented 3 years ago

Yes, any xid error will trigger this event.

ruiwen-zhao commented 3 years ago

@guptaNswati Thanks for the clarification!