ROCm / rocm_smi_lib

ROCm SMI LIB
https://rocm.docs.amd.com/projects/rocm_smi_lib/en/latest/
MIT License
111 stars 48 forks source link

Discrepancy between "rocm-smi" and "rocm-smi -a" #149

Open misos1 opened 6 months ago

misos1 commented 6 months ago

Notice that "rocm-smi -a" does not give power consumption for some cards. But "rocm-smi" can obtain it for all. Also it is a little strange when using json output that one must search for json keys like "Average Graphics Package Power (W)" and "Current Socket Graphics Package Power (W)" and it changes often. Ubuntu 22.04.3 LTS and rocm 6.0 and everything was fine with older version (something like 5.6 or 5.7 I think).

=================================== Power Consumption ====================================
GPU[0]      : Average Graphics Package Power (W): N/A (Secondary die)
GPU[1]      : Current Socket Graphics Package Power (W): 4.0
GPU[2]      : Current Socket Graphics Package Power (W): 3.0
======================================= ROCm System Management Interface =======================================
================================================= Concise Info =================================================
Device  [Model : Revision]    Temp    Power  Partitions      SCLK    MCLK    Fan     Perf  PwrCap  VRAM%  GPU%  
        Name (20 chars)       (Edge)  (Avg)  (Mem, Compute)                                                     
================================================================================================================
0       [0x471e : 0xc8]       42.0°C  6.0W   N/A, N/A        0Mhz    96Mhz   0%      auto  303.0W    1%   0%    
        0x744c                                                                                                  
1       [RX Vega64 : 0xc0]    34.0°C  4.0W   N/A, N/A        852Mhz  167Mhz  14.51%  auto  264.0W    0%   0%    
        Vega 10 XL/XT [Radeo                                                                                    
2       [RX Vega64 : 0x00]    33.0°C  3.0W   N/A, N/A        852Mhz  167Mhz  14.51%  auto  220.0W    0%   0%    
        Vega 10 XTX [Radeon                                                                                     
================================================================================================================
============================================= End of ROCm SMI Log ==============================================
rakicaleksandar1999 commented 4 months ago

Look at the rocm_smi_lib/python_smi_tools/rocm_smi.py file. The showPower function prints N/A (Secondary die) when the checkIfSecondaryDie(device) condition is met, while the showAllConcise function doesn't consider that condition.

charis-poag-amd commented 4 months ago

Thank you for pointing this out. We have a fix incoming this specific issue @rakicaleksandar1999 outlined.

I'll keep this open until the fix comes over to the develop branch.