NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
373 stars 49 forks source link

Multi host fixes #111

Closed dmonakhov closed 10 months ago

dmonakhov commented 11 months ago

PR https://github.com/NVIDIA/DCGM/pull/110 introduce multi-host/health_check.py sample, but code merged is broken in many places, original issue https://github.com/NVIDIA/DCGM/issues/109

dmonakhov commented 11 months ago

@dbeer ping, What do you think about my statement above? IMHO we should use large buffers by default in order measure actual bandwidth, not mixes case , so --min-alg-bandwidth and --min-bandwidth will have reasonable limits.

dbeer commented 10 months ago

Thank you for the MR! Everything looks good now, but we just need you to sign your work. Can you do that for us? https://github.com/NVIDIA/DCGM/blob/master/docs/contributing.md#signing-your-work

dmonakhov commented 10 months ago

@dbeer Ping, I've added signoff as you request, please merge if no questions left

dbeer commented 10 months ago

Sorry, I didn't get a notification when you signed the MR. Thank you!