NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
373 stars 49 forks source link

NVSwitch power #114

Open Mutinifni opened 10 months ago

Mutinifni commented 10 months ago

I'm attempting to measure NVSwitch power usage using DCGM on a DGX-A100 machine:

❯ dcgmi group -l
+-------------------+----------------------------------------------------------+
| GROUPS                                                                       |
| 2 groups found.                                                              |
+===================+==========================================================+
| Groups            |                                                          |
| -> 0              |                                                          |
|    -> Group ID    | 0                                                        |
|    -> Group Name  | DCGM_ALL_SUPPORTED_GPUS                                  |
|    -> Entities    | GPU 0, GPU 1, GPU 2, GPU 3, GPU 4, GPU 5, GPU 6, GPU 7   |
| -> 1              |                                                          |
|    -> Group ID    | 1                                                        |
|    -> Group Name  | DCGM_ALL_SUPPORTED_NVSWITCHES                            |
|    -> Entities    | Switch 12, Switch 10, Switch 9, Switch 11, Switch 8, Switch 13 |
+-------------------+----------------------------------------------------------+

❯ dcgmi dmon -g 1 -e 701,702,703,704
#Entity   SWVOLT                      SWCUR                       SCIDDQ                      SCDVDD
ID
Switch 13 N/A                         0                           3                           0
Switch 8  N/A                         0                           3                           0
Switch 11 N/A                         0                           3                           0
Switch 9  N/A                         0                           3                           0
Switch 10 N/A                         0                           3                           0

I have a couple of questions:

  1. The SWVOLT field always displays N/A, and the current fields never change. How could I get this to work?
  2. Which is the correct current value to use for the power calculation for NVSwitch (V * I)?

Thanks!

glowkey commented 10 months ago

SWVOLT is not supported on A100 unfortunately.

Mutinifni commented 10 months ago

Could you please let me know which GPUs it is supported on? Also, how would I obtain the power reading? (Q2)