NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
373 stars 49 forks source link

question about : reference implementation of DCGM + NCCL multi-node testing. #109

Closed dmonakhov closed 11 months ago

dmonakhov commented 11 months ago

Hi, release nodes for 5.2.3 https://docs.nvidia.com/datacenter/dcgm/latest/release-notes/changelog.html#id3 has following note

Added a reference implementation of DCGM + NCCL multi-node testing. But I failed to find it, can you please point where it can be found

nikkon-dev commented 11 months ago

@dmonakhov,

The multi-node reference is added in https://github.com/NVIDIA/DCGM/pull/110

dmonakhov commented 11 months ago

Reopen this issue because code merged is broken.