ROCm / rocm_smi_lib

ROCm SMI LIB
https://rocm.docs.amd.com/projects/rocm_smi_lib/en/latest/
MIT License
122 stars 49 forks source link

[Issue]: rocm-smi --setpoweroverdrive does allow lowering the power usage anymore #190

Open acaleechurn opened 2 months ago

acaleechurn commented 2 months ago

Problem Description

rocm-smi --setpoweroverdrive 200 does allow lowering the power usage anymore. This was functional (6.1.2) prior to upgrading. We would lower the temperature significantly with minimal impact on training times. Operating System

22.04.4 LTS (Jammy Jellyfish) CPU

AMD EPYC 7402P GPU

AMD Instinct MI100 ROCm Version

ROCm 6.2.0 ROCm Component

amdsmi, rocm_smi_lib Steps to Reproduce

acaleechurn@svr-ph-ml01:~$ rocm-smi --setpoweroverdrive 200 ============================ ROCm System Management Interface ============================ ================================ Set GPU Power OverDrive ================================= ERROR: GPU[0] : Unable to set Power OverDrive ERROR: GPU[0] : Value cannot be less than: 290W ERROR: GPU[1] : Unable to set Power OverDrive ERROR: GPU[1] : Value cannot be less than: 290W ERROR: GPU[2] : Unable to set Power OverDrive ERROR: GPU[2] : Value cannot be less than: 290W

================================== End of ROCm SMI Log ===================================

Operating System

"Ubuntu" VERSION="22.04.4 LTS (Jammy Jellyfish)"

CPU

AMD EPYC 7402P

GPU

AMD Instinct MI100

ROCm Version

ROCm 6.2.0

ROCm Component

amdsmi

Steps to Reproduce

Cleaning up the install and running a multi-version install with the kernel-mode-driver from 6.1.2 works as expected. Upgrading to 6.2.0 breaks the functionality.

acaleechurn@svr-ph-ml01:~$ rocm-smi --setpoweroverdrive 200

acaleechurn@svr-ph-ml01:~$ rocm-smi --setpoweroverdrive 200 ============================ ROCm System Management Interface ============================ ================================ Set GPU Power OverDrive ================================= GPU[0] : Successfully set power to: 200W GPU[1] : Successfully set power to: 200W GPU[2] : Successfully set power to: 200W

================================== End of ROCm SMI Log =================================== NAME="Ubuntu" VERSION="22.04.4 LTS (Jammy Jellyfish)" CPU: model name : AMD EPYC 7402P 24-Core Processor GPU: Name: AMD EPYC 7402P 24-Core Processor Marketing Name: AMD EPYC 7402P 24-Core Processor Name: gfx908 Marketing Name: AMD Instinct MI100 Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack- Name: gfx908 Marketing Name: AMD Instinct MI100 Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack- Name: gfx908 Marketing Name: AMD Instinct MI100 Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack- acaleechurn@svr-ph-ml01:~$ rocminfo --support ROCk module version 6.7.0 is loaded HSA System Attributes

Runtime Version: 1.14 Runtime Ext Version: 1.6 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE System Endianness: LITTLE Mwaitx: DISABLED DMAbuf Support: YES

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

acaleechurn@svr-ph-ml01:~$ (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support �[37mROCk module version 6.8.5 is loaded�[0m HSA System Attributes

Runtime Version: 1.14 Runtime Ext Version: 1.6 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE System Endianness: LITTLE Mwaitx: DISABLED DMAbuf Support: YES

HSA Agents

Agent 1

Name: AMD EPYC 7402P 24-Core Processor Uuid: CPU-XX Marketing Name: AMD EPYC 7402P 24-Core Processor Vendor Name: CPU Feature: None specified Profile: FULL_PROFILE Float Round Mode: NEAR Max Queue Number: 0(0x0) Queue Min Size: 0(0x0) Queue Max Size: 0(0x0) Queue Type: MULTI Node: 0 Device Type: CPU Cache Info: L1: 32768(0x8000) KB Chip ID: 0(0x0) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 2800 BDFID: 0 Internal Node ID: 0 Compute Unit: 24 SIMDs per CU: 0 Shader Engines: 0 Shader Arrs. per Eng.: 0 WatchPts on Addr. Ranges:1 Memory Properties: Features: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: FINE GRAINED Size: 263781388(0xfb8fc0c) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 2 Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 263781388(0xfb8fc0c) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 3 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 263781388(0xfb8fc0c) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:4KB Alloc Alignment: 4KB Accessible by all: TRUE ISA Info:

Agent 2

Name: gfx908 Uuid: GPU-e336f877361f1399 Marketing Name: AMD Instinct MI100 Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 1 Device Type: GPU Cache Info: L1: 16(0x10) KB L2: 8192(0x2000) KB Chip ID: 29580(0x738c) ASIC Revision: 2(0x2) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 1502 BDFID: 35328 Internal Node ID: 1 Compute Unit: 120 SIMDs per CU: 4 Shader Engines: 8 Shader Arrs. per Eng.: 1 WatchPts on Addr. Ranges:4 Coherent Host Access: FALSE Memory Properties: Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 64(0x40) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 40(0x28) Max Work-item Per CU: 2560(0xa00) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Packet Processor uCode:: 67 SDMA engine uCode:: 18 IOMMU Support:: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 33538048(0x1ffc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 33538048(0x1ffc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 3 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Recommended Granule:0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack- Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32

Agent 3

Name: gfx908 Uuid: GPU-c13bf411f2279689 Marketing Name: AMD Instinct MI100 Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 2 Device Type: GPU Cache Info: L1: 16(0x10) KB L2: 8192(0x2000) KB Chip ID: 29580(0x738c) ASIC Revision: 2(0x2) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 1502 BDFID: 17920 Internal Node ID: 2 Compute Unit: 120 SIMDs per CU: 4 Shader Engines: 8 Shader Arrs. per Eng.: 1 WatchPts on Addr. Ranges:4 Coherent Host Access: FALSE Memory Properties: Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 64(0x40) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 40(0x28) Max Work-item Per CU: 2560(0xa00) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Packet Processor uCode:: 67 SDMA engine uCode:: 18 IOMMU Support:: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 33538048(0x1ffc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 33538048(0x1ffc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 3 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Recommended Granule:0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack- Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32

Agent 4

Name: gfx908 Uuid: GPU-34985558949eb94a Marketing Name: AMD Instinct MI100 Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 3 Device Type: GPU Cache Info: L1: 16(0x10) KB L2: 8192(0x2000) KB Chip ID: 29580(0x738c) ASIC Revision: 2(0x2) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 1502 BDFID: 1280 Internal Node ID: 3 Compute Unit: 120 SIMDs per CU: 4 Shader Engines: 8 Shader Arrs. per Eng.: 1 WatchPts on Addr. Ranges:4 Coherent Host Access: FALSE Memory Properties: Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 64(0x40) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 40(0x28) Max Work-item Per CU: 2560(0xa00) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Packet Processor uCode:: 67 SDMA engine uCode:: 18 IOMMU Support:: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 33538048(0x1ffc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 33538048(0x1ffc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 3 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Recommended Granule:0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack- Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 Done Additional Information

OS: NAME="Ubuntu" VERSION="22.04.4 LTS (Jammy Jellyfish)" CPU: model name : AMD EPYC 7402P 24-Core Processor GPU: Name: AMD EPYC 7402P 24-Core Processor Marketing Name: AMD EPYC 7402P 24-Core Processor Name: gfx908 Marketing Name: AMD Instinct MI100 Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack- Name: gfx908 Marketing Name: AMD Instinct MI100 Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack- Name: gfx908 Marketing Name: AMD Instinct MI100 Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack- (acal-ml) acaleechurn@svr-ph-ml01:~$

Additional Information

ROCM 6.2 with the kernel mode driver from the same repo does not work but the same install with the kernel mode driver from 6.1.2 works as expected.

OpenMOSE commented 2 months ago

i'm also same problem.

AMD MI100 x 2 AMD Ryzen 5700X3D

Ubuntu 22.04 Rocm 6.2.0

harkgill-amd commented 1 month ago

Hi @acaleechurn, could you please confirm if you are able to use the --setpoweroverdrive option for values greater than 290W?

acaleechurn commented 1 month ago

Hi @harkgill-amd

I have removed the kernel mode driver from 6.1.2 and installed the one from 6.2.0 and I cannot set anything above or under the displayed value.

acaleechurn@svr-ph-ml01:~$ !308 /opt/rocm-6.2.0/bin/rocm-smi --setpoweroverdrive 150

============================ ROCm System Management Interface ============================ ================================ Set GPU Power OverDrive ================================= ERROR: GPU[0] : Unable to set Power OverDrive ERROR: GPU[0] : Value cannot be less than: 290W ERROR: GPU[1] : Unable to set Power OverDrive ERROR: GPU[1] : Value cannot be less than: 290W ERROR: GPU[2] : Unable to set Power OverDrive ERROR: GPU[2] : Value cannot be less than: 290W

================================== End of ROCm SMI Log =================================== acaleechurn@svr-ph-ml01:~$ !309 /opt/rocm-6.2.0/bin/rocm-smi --setpoweroverdrive 295

============================ ROCm System Management Interface ============================ ================================ Set GPU Power OverDrive ================================= ERROR: GPU[0] : Unable to set Power OverDrive ERROR: GPU[0] : Value cannot be greater than: 290W ERROR: GPU[1] : Unable to set Power OverDrive ERROR: GPU[1] : Value cannot be greater than: 290W ERROR: GPU[2] : Unable to set Power OverDrive ERROR: GPU[2] : Value cannot be greater than: 290W

================================== End of ROCm SMI Log ===================================

harkgill-amd commented 1 month ago

Hi @acaleechurn, quick update, I was able to reproduce this issue internally on a MI100 system. Will continue to investigate this issue and update this thread with relevant details.

OpenMOSE commented 2 weeks ago

gooday do you have any update?

in Rocm 6.2.2 still couldnt change --setpoweroverdrive