ROCm / ROC-smi

ROC System Management Interface
https://github.com/RadeonOpenCompute/ROC-smi/blob/master/README.md
178 stars 55 forks source link

Rocm-smi not showing clocks on Radeon VII #55

Closed sebpuetz closed 5 years ago

sebpuetz commented 5 years ago

I'm on Linux Mint (Ubuntu 18.04 in disguise) with kernel 4.20, I installed rocm-dkms through apt and rocm-smi doesn't fetch the clocks at all. Are these known limitations or did I mess up during installing? edit: Forgot to mention that I'm on ROCm 2.1

rocm-smi Output:

========================        ROCm System Management Interface        ========================
================================================================================================
GPU   Temp   AvgPwr   SCLK    MCLK    PCLK           Fan     Perf    PwrCap   SCLK OD   MCLK OD  GPU%
GPU[0]      : WARNING: Empty SysFS value: pclk
0     34.0c  20.0W    N/A     N/A     N/A            21.96%  auto    250.0W   0%        0%       0%       
================================================================================================
========================               End of ROCm SMI Log              ========================

rocm-smi -a Output:

========================        ROCm System Management Interface        ========================
================================================================================================
GPU[0]      : GPU ID: 0x66af
================================================================================================
================================================================================================
GPU[0]      : Temperature: 34.0c
================================================================================================
================================================================================================
GPU[0]      : WARNING: Empty SysFS value: pclk
GPU[0]      : WARNING: Empty SysFS value: pclk
GPU[0]      : Unable to determine current clocks. Check dmesg or GPU temperature
================================================================================================
GPU[0]      : Fan Level: 56 (21.96)%
================================================================================================
================================================================================================
GPU[0]      : Current Performance Level: auto
================================================================================================
================================================================================================
GPU[0]      : Current GPU OverDrive value: 0%
================================================================================================
================================================================================================
GPU[0]      : Current GPU Memory OverDrive value: 0%
================================================================================================
================================================================================================
GPU[0]      : Max Graphics Package Power: 250.0W
================================================================================================
================================================================================================
GPU[0]      : 
GPU[0]      : PROFILE_INDEX(NAME) CLOCK_TYPE(NAME) FPS UseRlcBusy MinActiveFreqType MinActiveFreq BoosterFreqType BoosterFreq PD_Data_limit_c PD_Data_error_coeff PD_Data_error_rate_coeff
GPU[0]      :  0 3D_FULL_SCREEN :
GPU[0]      :                     0(       GFXCLK)       0       1       2       0       4     800 4587520  -65536       0
GPU[0]      :                     1(       SOCCLK)       0       1       4     850       4     800  327680  -65536       0
GPU[0]      :                     2(         UCLK)       0       1       4     850       4     800  327680  -65536       0
GPU[0]      :                     3(         FCLK)       0       1       4     850       4     800  327680  -65536       0
GPU[0]      :  1   POWER_SAVING :
GPU[0]      :                     0(       GFXCLK)       0       0       1       0       3       0 5898240  -65536       0
GPU[0]      :                     1(       SOCCLK)       0       0       1       0       3       0 1310720   -6553       0
GPU[0]      :                     2(         UCLK)       0       0       1       0       3       0 1966080  -65536       0
GPU[0]      :                     3(         FCLK)       0       0       0       0       3     800 1966080   -6553       0
GPU[0]      :  2          VIDEO*:
GPU[0]      :                     0(       GFXCLK)       0       1       1       0       4     500 4587520   -6553       0
GPU[0]      :                     1(       SOCCLK)       0       0       1       0       4     500 1310720   -6553       0
GPU[0]      :                     2(         UCLK)       0       0       1       0       4     500 1966080  -65536       0
GPU[0]      :                     3(         FCLK)       0       0       3       0       4     500 1966080   -6553       0
GPU[0]      :  3             VR :
GPU[0]      :                     0(       GFXCLK)       0       1       0    1540       4     800 5898240   -6553   65536
GPU[0]      :                     1(       SOCCLK)       0       1       2       0       4     800  327680  -32768  -65536
GPU[0]      :                     2(         UCLK)       0       1       2       0       4     800  327680  -32768  -65536
GPU[0]      :                     3(         FCLK)       0       1       2       0       4     800  327680  -32768  -65536
GPU[0]      :  4        COMPUTE :
GPU[0]      :                     0(       GFXCLK)       0       1       0    1600       3       0 3932160  -65536  -65536
GPU[0]      :                     1(       SOCCLK)       0       0       4     850       3       0  327680  -65536  -32768
GPU[0]      :                     2(         UCLK)       0       0       4     850       3       0  327680  -65536  -32768
GPU[0]      :                     3(         FCLK)       0       0       4     850       3       0  327680  -65536  -32768
GPU[0]      :  5         CUSTOM :
GPU[0]      :                     0(       GFXCLK)       0       0       1       0       4     800 4587520  -65536       0
GPU[0]      :                     1(       SOCCLK)       0       0       1       0       4     800  327680   -6553       0
GPU[0]      :                     2(         UCLK)       0       0       1       0       4     800  327680  -65536       0
GPU[0]      :                     3(         FCLK)       0       0       0       0       4     800  327680   -6553       0
================================================================================================
================================================================================================
GPU[0]      : Average Graphics Package Power: 41.0W
================================================================================================
================================================================================================
GPU[0]      : Supported GPU clock frequencies on GPU0
GPU[0]      : 0: 701Mhz 
GPU[0]      : 1: 809Mhz 
GPU[0]      : 2: 1135Mhz 
GPU[0]      : 3: 1373Mhz 
GPU[0]      : 4: 1547Mhz 
GPU[0]      : 5: 1684Mhz 
GPU[0]      : 6: 1750Mhz 
GPU[0]      : 7: 1774Mhz 
GPU[0]      : 8: 1802Mhz 
GPU[0]      : 
GPU[0]      : Supported GPU Memory clock frequencies on GPU0
GPU[0]      : 0: 351Mhz 
GPU[0]      : 1: 801Mhz 
GPU[0]      : 2: 1001Mhz 
GPU[0]      : 
GPU[0]      : Supported PCIE clock frequencies on GPU0
GPU[0]      : 
================================================================================================
================================================================================================
GPU[0]      : Current GPU use: 16%
================================================================================================
================================================================================================
GPU[0]      : Cannot get PCIe bandwidth
================================================================================================
WARNING: One or more commands failed
========================               End of ROCm SMI Log              ========================

rocminfo

=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (number of timestamp)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 2700X Eight-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0                                  
  Queue Min Size:          0                                  
  Queue Max Size:          0                                  
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768KB                            
  Chip ID:                 0                                  
  Cacheline Size:          64                                 
  Max Clock Frequency (MHz):3700                               
  BDFID:                   0                                  
  Compute Unit:            16                                 
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    49448208KB                         
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    49448208KB                         
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        TRUE                               
  ISA Info:                
    N/A                      
*******                  
Agent 2                  
*******                  
  Name:                    gfx906                             
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128                                
  Queue Min Size:          4096                               
  Queue Max Size:          131072                             
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16KB                               
  Chip ID:                 26287                              
  Cacheline Size:          64                                 
  Max Clock Frequency (MHz):1802                               
  BDFID:                   10240                              
  Compute Unit:            60                                 
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64                                 
  Workgroup Max Size:      1024                               
  Workgroup Max Size Per Dimension:
    Dim[0]:                  67109888                           
    Dim[1]:                  671089664                          
    Dim[2]:                  0                                  
  Grid Max Size:           4294967295                         
  Waves Per CU:            40                                 
  Max Work-item Per CU:    2560                               
  Grid Max Size per Dimension:
    Dim[0]:                  4294967295                         
    Dim[1]:                  4294967295                         
    Dim[2]:                  4294967295                         
  Max number Of fbarriers Per Workgroup:32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16760832KB                         
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64KB                               
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Acessible by all:        FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx906          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Dimension: 
        Dim[0]:                  67109888                           
        Dim[1]:                  1024                               
        Dim[2]:                  16777217                           
      Workgroup Max Size:      1024                               
      Grid Max Dimension:      
        x                        4294967295                         
        y                        4294967295                         
        z                        4294967295                         
      Grid Max Size:           4294967295                         
      FBarrier Max Size:       32                                 
*** Done ***       

Thanks in advance!

kentrussell commented 5 years ago

It looks like the DPM features aren't being enabled on there. There has been a kernel fix to address it a bit to make things more clear, but it looks like DPM isn't enabled, which is why the sclk/mclk/pclk aren't being displayed (and why the supported clocks are printed, but they aren't listing the current clock). Can you attach your dmesg to see if it's giving any useful insight as to what's going on? Thanks!

sebpuetz commented 5 years ago

Sure, I just attached the entire output as a txt, thanks! dmesg.txt

kentrussell commented 5 years ago

Can you try to add amdgpu.ppfeaturemask=0xffffffff to your kernel parameters? (Either edit grub.cfg and add it to the vmlinuz line, or add it to your /etc/default/grub file in the GRUB_CMDLINE_LINUX_DEFAULT string. Then give it a reboot and see if it's there. Vega20 doesn't have all of the PowerPlay features enabled by default, so this might be enough to give it a kick (since dmesg didn't show any failures or anything useful)

sebpuetz commented 5 years ago

Hi, I tried both places to add the string, but it doesn't seem to work after rebooting. I again attached the dmesg output as a txt.

rocm-smi -c

========================        ROCm System Management Interface        ========================
================================================================================================
GPU[0]      : WARNING: Empty SysFS value: pclk
GPU[0]      : WARNING: Empty SysFS value: pclk
GPU[0]      : Unable to determine current clocks. Check dmesg or GPU temperature
WARNING: One or more commands failed
========================               End of ROCm SMI Log              ========================

/etc/default/grub

# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
#   info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
GRUB_TIMEOUT_STYLE=hidden
GRUB_TIMEOUT=10
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="amdgpu.ppfeaturemask=0xffffffff"
GRUB_CMDLINE_LINUX=""

# Uncomment to enable BadRAM filtering, modify to suit your needs
# This works with Linux (no patch required) and with any kernel that obtains
# the memory map information from GRUB (GNU Mach, kernel of FreeBSD ...)
#GRUB_BADRAM="0x01234567,0xfefefefe,0x89abcdef,0xefefefef"

# Uncomment to disable graphical terminal (grub-pc only)
#GRUB_TERMINAL=console

# The resolution used on graphical terminal
# note that you can use only modes which your graphic card supports via VBE
# you can see them in real GRUB with the command `vbeinfo'
#GRUB_GFXMODE=640x480

# Uncomment if you don't want GRUB to pass "root=UUID=xxx" parameter to Linux
#GRUB_DISABLE_LINUX_UUID=true

# Uncomment to disable generation of recovery mode menu entries
#GRUB_DISABLE_RECOVERY="true"

# Uncomment to get a beep at grub start
#GRUB_INIT_TUNE="480 440 1"

new_dmesg.txt

kentrussell commented 5 years ago

Sorry, I sent the message before getting some caffeine in my system. If you update /etc/default/grub, you'll need to do a sudo update-grub to apply the settings. That way it'll end up in your grub.cfg file

Normally the GRUB_CMDLINE_LINUX_DEFAULT should have GRUB_CMDLINE_LINUX_DEFAULT="quiet splash" , so if you removed the quiet splash part to add the amdgpu.ppfeaturemask in, then you should add that back in, and just add the amdgpu.ppfeaturemask=0xffffffff to it, so it looks like: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amdgpu.ppfeaturemask=0xffffffff"

If you took out quiet/splash before, that's alright. If not, you'll lose your splash screen and see a lot more info during bootup on your console.

sebpuetz commented 5 years ago

No worries, unfortunately

sudo update-grub
reboot

didn't make the clocks show up either. I disabled the splash intentionally, but thanks for the heads up!

kentrussell commented 5 years ago

I think that we'll have something for this for the next 2.2 release to help to address this. I am trying to pull it in for the next batch of testing. The big issue is that while it will print the clocks, it doesn't explain why DPM appears to be disabled. @fxkamd do you happen to have any insight?

sebpuetz commented 5 years ago

Hi, thanks for looking into this! I just switched to Ubuntu 18.04 from Linux Mint and everything is displayed correctly.

/opt/rocm/bin/rocm-smi

========================        ROCm System Management Interface        ========================
================================================================================================
GPU   Temp   AvgPwr   SCLK    MCLK    PCLK           Fan     Perf    PwrCap   SCLK OD   MCLK OD  GPU%
0     36.0c  20.0W    809Mhz  351Mhz  2.5GT/s, x16 80Mhz21.96%  auto    250.0W   0%        0%       0%       
================================================================================================
========================               End of ROCm SMI Log              ========================
kentrussell commented 5 years ago

Glad to hear it! I know that we've had some issues with Mint before, so at least things are working properly now! It's not "officially" supported, so I guess there's at least one thing in the kernel that changed that caused DPM to not load. But we've got a workaround (using an "officially supported" OS), so that's good. And I guess that means that ROCm doesn't magically work on Mint right now. Also good to know.

jlgreathouse commented 5 years ago

I suspect that the problem is that, when using Linux Mint, @sebpuetz was running kernel 4.20. 4.20 may not have had the Vega 20 DPM code merged in yet.

kentrussell commented 5 years ago

Closing this since it's resolved using a "supported OS" . And hopefully it works on Mint soon