ROCm / rocm_smi_lib

ROCm SMI LIB
https://rocm.docs.amd.com/projects/rocm_smi_lib/en/latest/
MIT License
123 stars 50 forks source link

RSMI_STATUS_PERMISSION on rocm-smi --setmclk #117

Open sandrain opened 1 year ago

sandrain commented 1 year ago

I am trying to set the memory clock frequency using rocm-smi, and it fails with the RSMI_STATUS_PERMISSION error. The performance level was set to manual:

$ rocm-smi --showhw

======================= ROCm System Management Interface =======================
============================ Concise Hardware Info =============================
GPU  DID   GFX RAS  SDMA RAS  UMC RAS  VBIOS           BUS
0    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:31:00.0
1    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:34:00.0
2    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:11:00.0
3    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:14:00.0
4    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:AE:00.0
5    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:B3:00.0
6    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:8E:00.0
7    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:93:00.0
================================================================================
============================= End of ROCm SMI Log ==============================
$ sudo rocm-smi -d 0 --showclkfrq --showperflevel

======================= ROCm System Management Interface =======================
============================ Show Performance Level ============================
GPU[0]          : Performance Level: manual
================================================================================
========================= Supported clock frequencies ==========================
GPU[0]          :
GPU[0]          : Supported fclk frequencies on GPU0
GPU[0]          : 0: 0Mhz *
GPU[0]          :
GPU[0]          : Supported mclk frequencies on GPU0
GPU[0]          : 0: 400Mhz
GPU[0]          : 1: 700Mhz
GPU[0]          : 2: 1200Mhz
GPU[0]          : 3: 1600Mhz *
GPU[0]          :
GPU[0]          : Supported sclk frequencies on GPU0
GPU[0]          : 0: 500Mhz
GPU[0]          : 1: 1700Mhz *
GPU[0]          :
GPU[0]          : Supported socclk frequencies on GPU0
GPU[0]          : 0: 666Mhz
GPU[0]          : 1: 857Mhz
GPU[0]          : 2: 1000Mhz
GPU[0]          : 3: 1090Mhz *
GPU[0]          : 4: 1333Mhz
GPU[0]          :
--------------------------------------------------------------------------------
================================================================================
============================= End of ROCm SMI Log ==============================
$ sudo rocm-smi -d 0 --setmclk 2

======================= ROCm System Management Interface =======================
============================== Set mclk Frequency ==============================
ERROR: 4 GPU[0]:RSMI_STATUS_PERMISSION: The user ID of the calling process does not have sufficient permission to execute a command.  Often this is fixed by running as root (sudo).
ERROR: GPU[0]           : Unable to set mclk bitmask to: 0x4
================================================================================
============================= End of ROCm SMI Log ==============================
$ sudo rocm-smi -d 0 --setmclk 0

======================= ROCm System Management Interface =======================
============================== Set mclk Frequency ==============================
ERROR: 4 GPU[0]:RSMI_STATUS_PERMISSION: The user ID of the calling process does not have sufficient permission to execute a command.  Often this is fixed by running as root (sudo).
ERROR: GPU[0]           : Unable to set mclk bitmask to: 0x1
================================================================================
============================= End of ROCm SMI Log ==============================

I found only sclk is configurable. Is this expected, or did I miss anything? Thanks!

rakataprime commented 1 year ago

did u set the feature mask and performance to manual like ? rocm-smi --setperflevel manual sudo rocm-smi --setvc 2 1701 915 --autorespond y sudo rocm-smi --setsrange 808 1740 --autorespond y

sandrain commented 1 year ago

@rakataprime Thanks for your input. I've tried the feature mask, which I didn't set properly before. However, I still cannot change the memory clock frequency as I wish.

BTW, I found the following error when the amdgpu module is loaded (regardless of the kernel parameter ppfeature):

[   14.070181] ------------[ cut here ]------------
[   14.070182] RAS ERROR: unexpected block id 15
[   14.070285] WARNING: CPU: 0 PID: 5 at /var/lib/dkms/amdgpu/5.16.9.22.20-1447096~20.04/build/amd/amdgpu/amdgpu_ras.h:579 amdgpu_ras_feature_enable+0x1b4/0x210 [amdgpu]
[   14.070285] Modules linked in: crc32_pclmul hid_generic ib_uverbs ib_core amdgpu(OE+) amd_iommu_v2 amdttm(OE) amd_sched(OE) amdkcl(OE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci drm usbhid nvme libahci i2c_algo_bit hid i40e nvme_core i2c_piix4 wmi
[   14.070299] CPU: 0 PID: 5 Comm: kworker/0:0 Tainted: G           OE     5.4.0-109-generic #123-Ubuntu
[   14.070300] Hardware name: Supermicro AS -4124GQ-TNMI/H12DGQ-NT6, BIOS 2.4 08/23/2022
[   14.070309] Workqueue: events work_for_cpu_fn
[   14.070366] RIP: 0010:amdgpu_ras_feature_enable+0x1b4/0x210 [amdgpu]
[   14.070368] Code: d9 63 59 00 01 e8 6c 88 50 ea 0f 0b 45 31 ff e9 79 ff ff ff 44 89 fe 48 c7 c7 80 c1 c2 c0 c6 05 b9 63 59 00 01 e8 4c 88 50 ea <0f> 0b 45 31 ff e9 ba fe ff ff 48 c7 c7 f8 c1 c2 c0 c6 05 9b 63 59
[   14.070369] RSP: 0018:ffffab01c0287bb8 EFLAGS: 00010286
[   14.070370] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000001f2a
[   14.070371] RDX: 0000000000000001 RSI: 0000000000000082 RDI: 0000000000000247
[   14.070371] RBP: ffffab01c0287be8 R08: 0000000000001f2a R09: 0000000000000004
[   14.070372] R10: 0000000000000000 R11: 0000000000000001 R12: ffff94709362c400
[   14.070372] R13: ffff9470801e0000 R14: ffffffffc0ccda20 R15: 000000000000000f
[   14.070373] FS:  0000000000000000(0000) GS:ffff94710cc00000(0000) knlGS:0000000000000000
[   14.070373] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   14.070374] CR2: 0000000000000000 CR3: 0000007d9c5a0005 CR4: 0000000000760ef0
[   14.070374] PKRU: 55555554
[   14.070374] Call Trace:
[   14.070431]  amdgpu_ras_feature_enable_on_boot+0x48/0xd0 [amdgpu]
[   14.070489]  ? sdma_v4_0_set_ecc_irq_state+0x61/0x70 [amdgpu]
[   14.070537]  amdgpu_ras_block_late_init+0x5c/0x1f0 [amdgpu]
[   14.070592]  ? amdgpu_irq_update+0x85/0xa0 [amdgpu]
[   14.070640]  ? amdgpu_irq_get+0x44/0x60 [amdgpu]
[   14.070691]  ? amdgpu_sdma_ras_late_init+0x7b/0xa0 [amdgpu]
[   14.070739]  amdgpu_ras_late_init+0x34/0x90 [amdgpu]
[   14.070787]  amdgpu_device_ip_late_init+0x7d/0x270 [amdgpu]
[   14.070867]  amdgpu_device_init.cold+0x16a3/0x1ea9 [amdgpu]
[   14.070873]  ? pci_read_config_word+0x27/0x40
[   14.070922]  amdgpu_driver_load_kms+0x1a/0x150 [amdgpu]
[   14.070970]  amdgpu_pci_probe+0x1ed/0x3f0 [amdgpu]
[   14.070975]  local_pci_probe+0x48/0x80
[   14.070976]  work_for_cpu_fn+0x1a/0x30
[   14.070978]  process_one_work+0x1eb/0x3b0
[   14.070979]  worker_thread+0x21e/0x400
[   14.070981]  kthread+0x104/0x140
[   14.070982]  ? process_one_work+0x3b0/0x3b0
[   14.070983]  ? kthread_park+0x90/0x90
[   14.070989]  ret_from_fork+0x22/0x40
[   14.070990] ---[ end trace 7be76cc2cca5f417 ]---
ppanchad-amd commented 3 months ago

@sandrain Apologies for the lack of response. Please check if your issue still exists with the latest ROCm 6.2. If not, please close the ticket. Thanks!

harkgill-amd commented 2 months ago

Hi @sandrain, I was not able to reproduce this issue and a few different fixes have been released for similar errors since this issue was first reported. As a result, I will close this issue for now. If you are still encountering this issue on ROCm 6.2, please leave a comment and I will re-open this ticket.

sandrain commented 2 months ago

Hi @harkgill-amd, thanks for your response. You may close the ticket. We cannot reproduce the problem anymore.

kulnaman commented 2 months ago

Hello, I am facing the same problem,

============================ ROCm System Management Interface ============================ ============================== Version of System Component =============================== Driver version: 6.8.5

=========================================== ID =========================================== GPU[0] : Device Name: Instinct MI210 GPU[0] : Device ID: 0x740f GPU[0] : Device Rev: 0x02 GPU[0] : Subsystem ID: 0x0c34 GPU[0] : GUID: 13566

======================================= Unique ID ======================================== GPU[0] : Unique ID: 0xd5a1afd4ec7820c1

========================================= VBIOS ========================================== GPU[0] : VBIOS version: 113-D67301-059

====================================== Temperature ======================================= GPU[0] : Temperature (Sensor edge) (C): 35.0 GPU[0] : Temperature (Sensor junction) (C): 36.0 GPU[0] : Temperature (Sensor memory) (C): 48.0 GPU[0] : Temperature (Sensor HBM 0) (C): 46.0 GPU[0] : Temperature (Sensor HBM 1) (C): 44.0 GPU[0] : Temperature (Sensor HBM 2) (C): 48.0 GPU[0] : Temperature (Sensor HBM 3) (C): 45.0

=============================== Current clock frequencies ================================ GPU[0] : fclk clock level: 0: (400Mhz) GPU[0] : mclk clock level: 3: (1600Mhz) GPU[0] : sclk clock level: 0: (1700Mhz) GPU[0] : socclk clock level: 3: (1090Mhz)

=================================== Current Fan Metric =================================== GPU[0] : Not supported

================================= Show Performance Level ================================= GPU[0] : Performance Level: manual

==================================== OverDrive Level ===================================== GPU[0] : get_overdrive_level_sclk, Not supported on the given system

==================================== OverDrive Level ===================================== GPU[0] : get_mem_overdrive_level_mclk, Not supported on the given system

======================================= Power Cap ======================================== GPU[0] : Max Graphics Package Power (W): 300.0

================================== Show Power Profiles =================================== GPU[0] : get_power_profiles, Not supported on the given system

=================================== Power Consumption ==================================== GPU[0] : Average Graphics Package Power (W): 60.0

============================== Supported clock frequencies =============================== GPU[0] : GPU[0] : Supported fclk frequencies on GPU0 GPU[0] : 0: 400Mhz GPU[0] : GPU[0] : Supported mclk frequencies on GPU0 GPU[0] : 0: 400Mhz GPU[0] : 1: 700Mhz GPU[0] : 2: 1200Mhz GPU[0] : 3: 1600Mhz GPU[0] : GPU[0] : Supported sclk frequencies on GPU0 GPU[0] : 0: 1700Mhz GPU[0] : 1: 1700Mhz GPU[0] : GPU[0] : Supported socclk frequencies on GPU0 GPU[0] : 0: 666Mhz GPU[0] : 1: 857Mhz GPU[0] : 2: 1000Mhz GPU[0] : 3: 1090Mhz GPU[0] : 4: 1333Mhz GPU[0] : GPU[0] :

========================================================================================== =================================== % time GPU is busy =================================== GPU[0] : GPU use (%): 0 GPU[0] : GFX Activity: 18250668

=================================== Current Memory Use =================================== GPU[0] : GPU Memory Allocated (VRAM%): 0 GPU[0] : GPU Memory Read/Write Activity (%): 0 GPU[0] : Memory Activity: 5365780 GPU[0] : Avg. Memory Bandwidth: 0

===================================== Memory Vendor ====================================== GPU[0] : GPU memory vendor: hynix

================================== PCIe Replay Counter =================================== GPU[0] : PCIe Replay Count: 0

===================================== Serial Number ====================================== GPU[0] : Serial Number: 692221000867

===================================== KFD Processes ====================================== No KFD PIDs currently running

================================== GPUs Indexed by PID =================================== No KFD PIDs currently running

======================= GPU Memory clock frequencies and voltages ======================== GPU[0] : OD_SCLK: GPU[0] : 0: 1700Mhz GPU[0] : 1: 1700Mhz GPU[0] : OD_MCLK: GPU[0] : 0: 400Mhz GPU[0] : 1: 1600Mhz

==================================== Current voltage ===================================== GPU[0] : Voltage (mV): 931

======================================= PCI Bus ID ======================================= GPU[0] : PCI Bus: 0000:27:00.0

================================== Firmware Information ================================== GPU[0] : get_firmware_version_ASD, Not supported on the given system GPU[0] : get_firmware_version_CE, Not supported on the given system GPU[0] : get_firmware_version_DMCU, Not supported on the given system GPU[0] : get_firmware_version_MC, Not supported on the given system GPU[0] : get_firmware_version_ME, Not supported on the given system GPU[0] : MEC firmware version: 83 GPU[0] : MEC2 firmware version: 83 GPU[0] : get_firmware_version_MES, Not supported on the given system GPU[0] : get_firmware_version_MES KIQ, Not supported on the given system GPU[0] : get_firmware_version_PFP, Not supported on the given system GPU[0] : RLC firmware version: 17 GPU[0] : get_firmware_version_RLC SRLC, Not supported on the given system GPU[0] : get_firmware_version_RLC SRLG, Not supported on the given system GPU[0] : get_firmware_version_RLC SRLS, Not supported on the given system GPU[0] : SDMA firmware version: 8 GPU[0] : SDMA2 firmware version: 8 GPU[0] : SMC firmware version: 00.68.60.00 GPU[0] : SOS firmware version: 0x00270082 GPU[0] : TA RAS firmware version: 27.00.01.60 GPU[0] : TA XGMI firmware version: 32.00.00.19 GPU[0] : get_firmware_version_UVD, Not supported on the given system GPU[0] : get_firmware_version_VCE, Not supported on the given system GPU[0] : VCN firmware version: 0x0110101c

====================================== Product Info ====================================== GPU[0] : Card Series: Instinct MI210 GPU[0] : Card Model: 0x740f GPU[0] : Card Vendor: Advanced Micro Devices, Inc. [AMD/ATI] GPU[0] : Card SKU: D67301 GPU[0] : Subsystem ID: 0x0c34 GPU[0] : Device Rev: 0x02 GPU[0] : Node ID: 2 GPU[0] : GUID: 13566 GPU[0] : GFX Version: gfx9010

======================================= Pages Info =======================================

================================= Show Valid sclk Range ================================== GPU[0] : Valid sclk range: 1700Mhz - 1700Mhz

================================= Show Valid mclk Range ================================== GPU[0] : Valid mclk range: 400Mhz - 1600Mhz

================================ Show Valid voltage Range ================================ ERROR: GPU[0] : Voltage curve regions unsupported.

================================== Voltage Curve Points ================================== ERROR: GPU[0] : Voltage curve Points unsupported.

==================================== Consumed Energy ===================================== GPU[0] : Energy counter: 1497096717012 GPU[0] : Accumulated Energy (uJ): 22905580055832.14

=============================== Current Compute Partition ================================ GPU[0] : Not supported on the given system

================================ Current Memory Partition ================================ GPU[0] : Not supported on the given system

================================== End of ROCm SMI Log ===================================

and running:

sudo rocm-smi --setmclk 2

============================ ROCm System Management Interface ============================ =================================== Set mclk Frequency =================================== GPU[0] : set_gpu_clk_freq_mclk, Permission denied ERROR: GPU[0] : Unable to set mclk bitmask to: 0x4

================================== End of ROCm SMI Log ===================================

harkgill-amd commented 2 months ago

Hi @kulnaman, thanks for bringing this back to our attention. On MI200/MI210, there is no MCLK change support, it only operates on a single clock.

Despite this, the set_gpu_clk_freq_mclk, Permission denied error is misleading as it seems that there is a misconfiguration in the user space rather than setmclk being unsupported. We are working towards a fix that will make the error message propagation more clear for users.