NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
373 stars 49 forks source link

How is NVLINK information obtained ? #103

Open irvingans opened 1 year ago

irvingans commented 1 year ago

https://github.com/NVIDIA/DCGM/blob/7e1012302679e4bb7496483b32dcffb56e528c92/dcgmlib/src/DcgmApi.cpp#L3515

Hi , regarding the function tsapiGetNvLinkLinkStatus, in a deeper level, how is the NvLink status obtained? Is it obtained from the GPU via nvidia driver ?

nikkon-dev commented 1 year ago

DCGM uses the NSCQ library, which must be installed separately from DCGM and is bound to the driver version.

irvingans commented 1 year ago

Thanks @nikkon-dev , got 1 more, How about the pulse test plugin, https://github.com/NVIDIA/DCGM/blob/7e1012302679e4bb7496483b32dcffb56e528c92/dcgmi/Diag.cpp#L642

what is the actual dependent library that provides the pulse test?