Open sandrain opened 1 year ago
did u set the feature mask and performance to manual like ?
rocm-smi --setperflevel manual
sudo rocm-smi --setvc 2 1701 915 --autorespond y
sudo rocm-smi --setsrange 808 1740 --autorespond y
@rakataprime Thanks for your input. I've tried the feature mask, which I didn't set properly before. However, I still cannot change the memory clock frequency as I wish.
BTW, I found the following error when the amdgpu module is loaded (regardless of the kernel parameter ppfeature
):
[ 14.070181] ------------[ cut here ]------------
[ 14.070182] RAS ERROR: unexpected block id 15
[ 14.070285] WARNING: CPU: 0 PID: 5 at /var/lib/dkms/amdgpu/5.16.9.22.20-1447096~20.04/build/amd/amdgpu/amdgpu_ras.h:579 amdgpu_ras_feature_enable+0x1b4/0x210 [amdgpu]
[ 14.070285] Modules linked in: crc32_pclmul hid_generic ib_uverbs ib_core amdgpu(OE+) amd_iommu_v2 amdttm(OE) amd_sched(OE) amdkcl(OE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci drm usbhid nvme libahci i2c_algo_bit hid i40e nvme_core i2c_piix4 wmi
[ 14.070299] CPU: 0 PID: 5 Comm: kworker/0:0 Tainted: G OE 5.4.0-109-generic #123-Ubuntu
[ 14.070300] Hardware name: Supermicro AS -4124GQ-TNMI/H12DGQ-NT6, BIOS 2.4 08/23/2022
[ 14.070309] Workqueue: events work_for_cpu_fn
[ 14.070366] RIP: 0010:amdgpu_ras_feature_enable+0x1b4/0x210 [amdgpu]
[ 14.070368] Code: d9 63 59 00 01 e8 6c 88 50 ea 0f 0b 45 31 ff e9 79 ff ff ff 44 89 fe 48 c7 c7 80 c1 c2 c0 c6 05 b9 63 59 00 01 e8 4c 88 50 ea <0f> 0b 45 31 ff e9 ba fe ff ff 48 c7 c7 f8 c1 c2 c0 c6 05 9b 63 59
[ 14.070369] RSP: 0018:ffffab01c0287bb8 EFLAGS: 00010286
[ 14.070370] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000001f2a
[ 14.070371] RDX: 0000000000000001 RSI: 0000000000000082 RDI: 0000000000000247
[ 14.070371] RBP: ffffab01c0287be8 R08: 0000000000001f2a R09: 0000000000000004
[ 14.070372] R10: 0000000000000000 R11: 0000000000000001 R12: ffff94709362c400
[ 14.070372] R13: ffff9470801e0000 R14: ffffffffc0ccda20 R15: 000000000000000f
[ 14.070373] FS: 0000000000000000(0000) GS:ffff94710cc00000(0000) knlGS:0000000000000000
[ 14.070373] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 14.070374] CR2: 0000000000000000 CR3: 0000007d9c5a0005 CR4: 0000000000760ef0
[ 14.070374] PKRU: 55555554
[ 14.070374] Call Trace:
[ 14.070431] amdgpu_ras_feature_enable_on_boot+0x48/0xd0 [amdgpu]
[ 14.070489] ? sdma_v4_0_set_ecc_irq_state+0x61/0x70 [amdgpu]
[ 14.070537] amdgpu_ras_block_late_init+0x5c/0x1f0 [amdgpu]
[ 14.070592] ? amdgpu_irq_update+0x85/0xa0 [amdgpu]
[ 14.070640] ? amdgpu_irq_get+0x44/0x60 [amdgpu]
[ 14.070691] ? amdgpu_sdma_ras_late_init+0x7b/0xa0 [amdgpu]
[ 14.070739] amdgpu_ras_late_init+0x34/0x90 [amdgpu]
[ 14.070787] amdgpu_device_ip_late_init+0x7d/0x270 [amdgpu]
[ 14.070867] amdgpu_device_init.cold+0x16a3/0x1ea9 [amdgpu]
[ 14.070873] ? pci_read_config_word+0x27/0x40
[ 14.070922] amdgpu_driver_load_kms+0x1a/0x150 [amdgpu]
[ 14.070970] amdgpu_pci_probe+0x1ed/0x3f0 [amdgpu]
[ 14.070975] local_pci_probe+0x48/0x80
[ 14.070976] work_for_cpu_fn+0x1a/0x30
[ 14.070978] process_one_work+0x1eb/0x3b0
[ 14.070979] worker_thread+0x21e/0x400
[ 14.070981] kthread+0x104/0x140
[ 14.070982] ? process_one_work+0x3b0/0x3b0
[ 14.070983] ? kthread_park+0x90/0x90
[ 14.070989] ret_from_fork+0x22/0x40
[ 14.070990] ---[ end trace 7be76cc2cca5f417 ]---
@sandrain Apologies for the lack of response. Please check if your issue still exists with the latest ROCm 6.2. If not, please close the ticket. Thanks!
Hi @sandrain, I was not able to reproduce this issue and a few different fixes have been released for similar errors since this issue was first reported. As a result, I will close this issue for now. If you are still encountering this issue on ROCm 6.2, please leave a comment and I will re-open this ticket.
Hi @harkgill-amd, thanks for your response. You may close the ticket. We cannot reproduce the problem anymore.
Hello, I am facing the same problem,
rocm-smi --setperflevel manual
GPU details:
rocm-smi -a
================================== End of ROCm SMI Log ===================================
and running:
sudo rocm-smi --setmclk 2
================================== End of ROCm SMI Log ===================================
Hi @kulnaman, thanks for bringing this back to our attention. On MI200/MI210, there is no MCLK change support, it only operates on a single clock.
Despite this, the set_gpu_clk_freq_mclk, Permission denied
error is misleading as it seems that there is a misconfiguration in the user space rather than setmclk being unsupported. We are working towards a fix that will make the error message propagation more clear for users.
I am trying to set the memory clock frequency using
rocm-smi
, and it fails with theRSMI_STATUS_PERMISSION
error. The performance level was set tomanual
:I found only
sclk
is configurable. Is this expected, or did I miss anything? Thanks!