NVIDIA / go-dcgm

Golang bindings for Nvidia Datacenter GPU Manager (DCGM)
Apache License 2.0
96 stars 27 forks source link

Test injection events are not caught in embedded mode. #67

Closed bingiflash closed 6 months ago

bingiflash commented 6 months ago

I'm trying to run this sample code in a GPU machine.

dcgm.Init(dcgm.Standalone,"localhost","0")

and I run injection with $ dcgmi test --inject --gpuid 0 -f 230 -v 64

$ ./health 
2024/05/10 23:57:38 Policy successfully set.
2024/05/10 23:57:38 Listening for violations...
2024/05/10 23:57:40 PolicyViolation      : XID Error
Timestamp  : 2024-05-10 23:57:41 +0000 UTC
Data       : {65}
2024/05/10 23:57:45 PolicyViolation      : XID Error
Timestamp  : 2024-05-10 23:57:46 +0000 UTC
Data       : {64}

dcgm.Init(dcgm.Embedded)

and I run injection with $ dcgmi test --inject --gpuid 0 -f 230 -v 64

$ ./health 
2024/05/10 23:57:38 Policy successfully set.
2024/05/10 23:57:38 Listening for violations...

I want to know if this is an issue with the test tool, or will embedding mode miss the events during real incidents as well?

side: I cannot run Standalone mode because I want to run this in a docker, and standalone is giving some systemctl errors

nikkon-dev commented 6 months ago

@bingiflash,

It is not possible to inject errors through dcgmi when using the embedded hostengine, as dcgmi requires a connection to the hostengine, which is not exposed in embedded mode.

With the embedded hostengine, you must use dcgmInjectEntityFieldValue DCGM API directly, but that's not provided in the go-dcgm.

bingiflash commented 6 months ago

Understood. thanks