NVIDIA / go-dcgm

Golang bindings for Nvidia Datacenter GPU Manager (DCGM)
Apache License 2.0
95 stars 27 forks source link

always return 0, when get GPU of process info by `dcgm.GetProcessInfo(XXX)` #64

Open berkaroad opened 7 months ago

berkaroad commented 7 months ago

Run benchmarks with 2 gpus, and compare with ./processInfo -pid 203639 and nvidia-smi.

'GPU ID' from ./processInfo -pid 203639 is GPU-0, GPU-0. But in nvidia-smi is GPU-0, GPU-1.

python3 ./benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
        --forward_only \
        --batch_size=16 \
        --model=resnet50  \
        --num_gpus=2 \
        --num_batches=500000 \
        --num_warmup_batches=10 \
        --data_name=imagenet \
        --allow_growth=True
root@k8s-node1:~/go-dcgm/samples/processInfo# ./processInfo -pid 203639
2024/04/07 11:51:51 Enabling DCGM watches to start collecting process stats. This may take a few seconds....
----------------------------------------------------------------------
GPU ID               : 0
----------Execution Stats---------------------------------------------
PID                          : 203639
Name                         : tf_cnn_benchmar
Start Time                   : 2024-04-03 20:29:37 +0800 CST
End Time                     : Running
----------Performance Stats-------------------------------------------
Energy Consumed (Joules)     : 0
Max GPU Memory Used (bytes)  : 5453643776
Avg SM Clock (MHz)           : 1590
Avg Memory Clock (MHz)       : 5000
Avg SM Utilization (%)       : 21
Avg Memory Utilization (%)   : 16
Avg PCIe Rx Bandwidth (MB)   : 9223372036854775792
Avg PCIe Tx Bandwidth (MB)   : 9223372036854775792
----------Event Stats-------------------------------------------------
Single Bit ECC Errors        : N/A
Double Bit ECC Errors        : N/A
Critical XID Errors          : 0
----------Slowdown Stats----------------------------------------------
Due to - Power (%)           : 0
       - Thermal (%)         : 0
       - Reliability (%)     : 9223372036854775792
       - Board Limit (%)     : 9223372036854775792
       - Low Utilization (%) : 9223372036854775792
       - Sync Boost (%)      : 0
----------Process Utilization-----------------------------------------
Avg SM Utilization (%)       : 48
Avg Memory Utilization (%)   : 38
----------------------------------------------------------------------
----------------------------------------------------------------------
GPU ID               : 0
----------Execution Stats---------------------------------------------
PID                          : 203639
Name                         : tf_cnn_benchmar
Start Time                   : 2024-04-03 20:29:37 +0800 CST
End Time                     : Running
----------Performance Stats-------------------------------------------
Energy Consumed (Joules)     : 0
Max GPU Memory Used (bytes)  : 227540992
Avg SM Clock (MHz)           : 585
Avg Memory Clock (MHz)       : 5000
Avg SM Utilization (%)       : N/A
Avg Memory Utilization (%)   : N/A
Avg PCIe Rx Bandwidth (MB)   : 9223372036854775792
Avg PCIe Tx Bandwidth (MB)   : 9223372036854775792
----------Event Stats-------------------------------------------------
Single Bit ECC Errors        : N/A
Double Bit ECC Errors        : N/A
Critical XID Errors          : 0
----------Slowdown Stats----------------------------------------------
Due to - Power (%)           : 0
       - Thermal (%)         : 0
       - Reliability (%)     : 9223372036854775792
       - Board Limit (%)     : 9223372036854775792
       - Low Utilization (%) : 9223372036854775792
       - Sync Boost (%)      : 0
----------Process Utilization-----------------------------------------
Avg SM Utilization (%)       : 0
Avg Memory Utilization (%)   : 0
----------------------------------------------------------------------
root@k8s-node1:~/go-dcgm/samples/processInfo# nvidia-smi 
Sun Apr  7 11:52:05 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:07.0 Off |                    0 |
| N/A   64C    P0    71W /  70W |   5204MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:00:08.0 Off |                    0 |
| N/A   43C    P0    27W /  70W |    220MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    203639      C   python3                          5201MiB |
|    1   N/A  N/A    203639      C   python3                           217MiB |
+-----------------------------------------------------------------------------+
berkaroad commented 7 months ago

in pkg/dcgm/process_info.go/getProcessInfo.

        pInfo := ProcessInfo{
            GPU:                uint(pidInfo.summary.gpuId), // will always use same gpu
            PID:                uint(pidInfo.pid),
            Name:               name,
            ProcessUtilization: processUtil,
            PCI:                pci,
            Memory:             memory,
            GpuUtilization:     gpuUtil,
            Clocks:             clocks,
            Violations:         violations,
            XIDErrors:          xidErrs,
        }

change uint(pidInfo.summary.gpuId) to uint(pidInfo.gpus[I].gpuId) will fix it.