NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
373 stars 49 forks source link

How to monitor specific GPU #87

Closed LukeLIN-web closed 1 year ago

LukeLIN-web commented 1 year ago

I only know this command which can work:

dcgmi dmon -e 1002,1003,1005,1009,1010 -c 10 -d 10000 

But it shows all GPU statuses. How to measure specific GPU status?

By the way, I wonder what is the unit of PCIe Bandwidth ? 864034668 means 864MB/s PCIe bandwidth?

nikkon-dev commented 1 year ago

@LukeLIN-web,

To specify which GPUs/MIG instances to use, you can utilize the -i argument in the dcgmi commands.

The PCIe bandwidth is measured in bytes per second, so 864MB/s is correct.

LukeLIN-web commented 1 year ago

@LukeLIN-web,

To specify which GPUs/MIG instances to use, you can utilize the -i argument in the dcgmi commands.

The PCIe bandwidth is measured in bytes per second, so 864MB/s is correct.

Thank you for your reply!