ROCm / ROCK-Kernel-Driver

AMDGPU Driver with KFD used by the ROCm project. Also contains the current Linux Kernel that matches this base driver
Other
320 stars 97 forks source link

[Issue]:Install amdgpu-dkms 1:6.8.5.60200-2009582.24.04 on Radeon Pro W7900 lead to the operate system serious crash about ADM GPU Driver #170

Closed Alic-Li closed 3 days ago

Alic-Li commented 1 month ago

Problem Description

About five days ago I receive the update from the repositories of Radeon . So I update the amdgpu-dkms for my Ubuntu 22.04 , but unfortunately , this update demolish my operate system. The specific symptoms are almost cannot enter to my Gnome-desktop. And when I enter desktop hardly , I see my Gnome-desktop flicker and cannot open any software . So I switched to openSUSE, updated openSUSE's amdgpu-dkms, and then the kde desktop encountered the same problem. Then I backed up my data, reinstalled Ubuntu and openSUSE, and rebuilt the production environment. As a result, I encountered the problem again when installing the driver. I tried to reinstall Gnome-desktop, but it didn't work. When I reboot the operate system, I encountered the following photos . I reinstall the Ubuntu 22.04 ,Ubuntu 24.04 and openSUSE but it did not solve the problem.

Operating System

Ubuntu 22.04.3 (jemmy jellyfish) | openSUSE Tumbleweed | Ubuntu 24.04 LTS

CPU

Intel I3-12100 with UHD 730

GPU

AMD Radeon Pro W7900

ROCm Version

ROCm 6.2.0

ROCm Component

No response

Steps to Reproduce

sudo apt-get install amdgpu-dkms
sudo reboot

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

alic-li@alic-li-B660M-D2H-DDR4:~$ rocminfo ROCk module version 6.8.5 is loaded =====================
HSA System Attributes
=====================
Runtime Version: 1.14 Runtime Ext Version: 1.6 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED DMAbuf Support: YES

==========
HSA Agents
==========


Agent 1


Name: 12th Gen Intel(R) Core(TM) i3-12100 Uuid: CPU-XX
Marketing Name: 12th Gen Intel(R) Core(TM) i3-12100 Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 49152(0xc000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 4300
BDFID: 0
Internal Node ID: 0
Compute Unit: 8
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 65595952(0x3e8ea30) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 65595952(0x3e8ea30) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 65595952(0x3e8ea30) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:


Agent 2


Name: gfx1100
Uuid: GPU-ed466fc6e51f9536
Marketing Name: AMD Radeon PRO W7900
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 6144(0x1800) KB
L3: 98304(0x18000) KB
Chip ID: 29768(0x7448)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1760
BDFID: 768
Internal Node ID: 1
Compute Unit: 96
SIMDs per CU: 2
Shader Engines: 6
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 232
SDMA engine uCode:: 21
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 47169536(0x2cfc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 47169536(0x2cfc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1100
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
Done

Additional Information

Image_1723125222860 IMG20240812122528 IMG20240812123240 IMG20240815184139 ubuntu-neoftch

Alic-Li commented 1 month ago

In addition, the fan speed of w7900 still cannot be adjusted. I updated to the latest amdgpu-dkms driver and adjusted the speed in the tty5 interface, but it still cannot be adjusted.

ppanchad-amd commented 3 weeks ago

@Alic-Li Internal ticket has been created to investigate this issue. Thanks!

kentrussell commented 3 weeks ago

Can you attach a full dmesg, ideally after trying to set the fan as well? That way we can see any issues during init and the display coming up, as well as the fan messages (if they appear)

Alic-Li commented 3 weeks ago

Hi kentrussell ! Thanks for you reply ! here is the full dmesg about during set the fan.

Alic-Li commented 3 weeks ago

alic-li@alic-li-B660M-D2H-DDR4:~$ sudo rocm-smi --setfan 255

============================ ROCm System Management Interface ============================

=================================== Set GPU Fan Speed ====================================

GPU[0] : Successfully set fan speed to level 255

==========================================================================================

================================== End of ROCm SMI Log ===================================

alic-li@alic-li-B660M-D2H-DDR4:~$ sudo dmesg | grep "amd" [ 0.000000] Linux version 6.8.0-40-generic (buildd@lcy02-amd64-075) (x86_64-linux-gnu-gcc-13 (Ubuntu 13.2.0-23ubuntu4) 13.2.0, GNU ld (GNU Binutils for Ubuntu) 2.42) #40-Ubuntu SMP PREEMPT_DYNAMIC Fri Jul 5 10:34:03 UTC 2024 (Ubuntu 6.8.0-40.40-generic 6.8.12) [ 5.023750] amdkcl: loading out-of-tree module taints kernel. [ 5.023753] amdkcl: module verification failed: signature and/or required key missing - tainting kernel [ 6.851379] [drm] amdgpu kernel modesetting enabled. [ 6.851382] [drm] amdgpu version: 6.8.5 [ 6.851484] amdgpu: Virtual CRAT table created for CPU [ 6.851491] amdgpu: Topology: Add CPU node [ 6.853136] amdgpu 0000:03:00.0: enabling device (0006 -> 0007) [ 6.857377] amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from VFCT [ 6.857379] amdgpu: ATOM BIOS: 113-D7070100-138 [ 6.861132] amdgpu 0000:03:00.0: amdgpu: CP RS64 enable [ 6.866216] amdgpu 0000:03:00.0: [drm:jpeg_v4_0_early_init [amdgpu]] JPEG decode is enabled in VM mode [ 6.879932] amdgpu 0000:03:00.0: vgaarb: deactivate vga console [ 6.879935] amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported [ 6.879961] amdgpu 0000:03:00.0: amdgpu: MEM ECC is active. [ 6.879962] amdgpu 0000:03:00.0: amdgpu: SRAM ECC is not presented. [ 6.879968] amdgpu 0000:03:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[101] ras_mask[101]

[ 6.880056] amdgpu 0000:03:00.0: amdgpu: VRAM: 46064M 0x0000008000000000 - 0x0000008B3EFFFFFF (46064M used) [ 6.880058] amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x00007FFF00000000 - 0x00007FFF1FFFFFFF [ 6.880122] [drm] amdgpu: 46064M of VRAM memory ready [ 6.880124] [drm] amdgpu: 32029M of GTT memory ready. [ 6.957150] amdgpu 0000:03:00.0: amdgpu: reserve 0x1300000 from 0x8b3c000000 for PSP TMR [ 7.097529] amdgpu 0000:03:00.0: amdgpu: GECC is enabled [ 7.114392] amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available [ 7.114396] amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available [ 7.114433] amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x0000003d, smu fw if version = 0x00000040, smu fw program = 0, smu fw version = 0x004e7e00 (78.126.0) [ 7.114443] amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched [ 7.280358] amdgpu 0000:03:00.0: amdgpu: SMU is initialized successfully! [ 7.546739] amdgpu 0000:03:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully. [ 7.810808] amdgpu: HMM registered 46064MB device memory [ 7.811871] kfd kfd: amdgpu: Allocated 3969056 bytes on gart [ 7.811882] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1 [ 7.811909] amdgpu: Virtual CRAT table created for GPU [ 7.812031] amdgpu: Topology: Add dGPU node [0x7448:0x1002] [ 7.812032] kfd kfd: amdgpu: added device 1002:7448 [ 7.812043] amdgpu 0000:03:00.0: amdgpu: SE 6, SH per SE 2, CU per SH 8, active_cu_number 96 [ 7.812046] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0 [ 7.812047] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0 [ 7.812048] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0 [ 7.812048] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0 [ 7.812049] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0 [ 7.812050] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0 [ 7.812050] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0 [ 7.812051] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0 [ 7.812051] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0 [ 7.812052] amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0 [ 7.812052] amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0 [ 7.812053] amdgpu 0000:03:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8 [ 7.812054] amdgpu 0000:03:00.0: amdgpu: ring vcn_unified_1 uses VM inv eng 1 on hub 8 [ 7.812054] amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 4 on hub 8 [ 7.812055] amdgpu 0000:03:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 14 on hub 0 [ 7.815559] amdgpu 0000:03:00.0: amdgpu: Using BAMACO for runtime pm [ 7.815855] [drm] Initialized amdgpu 3.58.0 20150101 for 0000:03:00.0 on minor 1 [ 7.822019] fbcon: amdgpudrmfb (fb0) is primary device [ 7.822022] amdgpu 0000:03:00.0: [drm] fb0: amdgpudrmfb frame buffer device [ 10.563346] amdgpu 0000:03:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=io+mem:owns=none [ 10.646171] snd_hda_intel 0000:03:00.1: bound 0000:03:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu]) [ 465.781001] amdgpu: manual fan speed control should be enabled first [ 545.327324] amdgpu: manual fan speed control should be enabled first [ 592.080974] amdgpu: manual fan speed control should be enabled first

I set it three times

kentrussell commented 3 weeks ago

So I don't see anything obvious for the flickering screen there, but I am less of a graphics guy (the internal ticket should be able to make progress there). For the fan, If you try to just set the fan speed to manual without setting a value, does it stay as "auto"? You can do it manually via: cd /sys/class/drm/card0/device/hwmon cd (on my test machine it's hwmon2 but it depends on your system config) cat ./pwm1_enable (Manual=1, Auto=2, off=0) Then try to set it to manual by> echo 1|sudo tee ./pwm1_enable Then verify it with cat ./pwm1_enable

If it stays at 2, it is likely that the firmware isn't actually changing the setting (and isn't giving us an error to say why). The internal ticket should be able to verify that pretty quickly. If it does change to 1, then maybe there's a bug in the SMI where it's not setting fan control to manual before trying to change the speed

Alic-Li commented 3 weeks ago

By the way , I finally figure out the resons of install amdgpu-dkms 1:6.8.5.60200-2009582.24.04 on Radeon Pro W7900 lead to the operate system serious crash about ADM GPU Driver. When I fix my operate system. I try to reinstall the amdgpu-kms but it didn't work. but , when I ovewrite install the amd open source gpu driver. Than the miracle was happened, The gnome desktop environment is rely on the amd opensoure gpu driver. So I solved this problem by my self . I think this might provide you with a clue to the solution this problem. Maybe installing amdgpu-dkms will affect the system's original driver.

Alic-Li commented 3 weeks ago
sudo apt install amdgpu amdgpu-core amdgpu-lib

--After executing the command, the system desktop environment returns to normal

Alic-Li commented 3 weeks ago

So I don't see anything obvious for the flickering screen there, but I am less of a graphics guy (the internal ticket should be able to make progress there). For the fan, If you try to just set the fan speed to manual without setting a value, does it stay as "auto"? You can do it manually via: cd /sys/class/drm/card0/device/hwmon cd (on my test machine it's hwmon2 but it depends on your system config) cat ./pwm1_enable (Manual=1, Auto=2, off=0) Then try to set it to manual by> echo 1|sudo tee ./pwm1_enable Then verify it with cat ./pwm1_enable

If it stays at 2, it is likely that the firmware isn't actually changing the setting (and isn't giving us an error to say why). The internal ticket should be able to verify that pretty quickly. If it does change to 1, then maybe there's a bug in the SMI where it's not setting fan control to manual before trying to change the speed

Could I ask your video card model? It also didn't work

alic-li@alic-li-B660M-D2H-DDR4:/sys/class/drm/card1/device/hwmon/hwmon2$ echo 1|sudo tee ./pwm1_enable 1 alic-li@alic-li-B660M-D2H-DDR4:/sys/class/drm/card1/device/hwmon/hwmon2$ cat ./pwm1_enable 2

Alic-Li commented 3 weeks ago

alic-li@alic-li-B660M-D2H-DDR4:/sys/class/drm/card1/device/hwmon/hwmon2$ cat ./pwm1_enable 2 alic-li@alic-li-B660M-D2H-DDR4:/sys/class/drm/card1/device/hwmon/hwmon2$ echo 255|sudo tee ./pwm1 255 tee: ./pwm1: Invalid parameters alic-li@alic-li-B660M-D2H-DDR4:/sys/class/drm/card1/device/hwmon/hwmon2$ cat ./pwm1 51 alic-li@alic-li-B660M-D2H-DDR4:/sys/class/drm/card1/device/hwmon/hwmon2$ echo 1|sudo tee ./pwm1_enable 1 alic-li@alic-li-B660M-D2H-DDR4:/sys/class/drm/card1/device/hwmon/hwmon2$ cat ./pwm1_enable 2

Alic-Li commented 3 weeks ago

So I think there's didn't have any bug in the SMI. Thanks for you help. I'll wait for the internal ticket's result .In addition,I have a RX-6750xt video card , But it could be adjust the fan speed with same condition. That's really a bit weird. I hope we can make the Radeon software ecosystem better together.

kentrussell commented 3 weeks ago

So amdgpu-dkms will replace the amdgpu kernel module with the newer one. It also points to a regression in the newer amdgpu-dkms code. As for my model that I tested, it's an old Fiji Nano R9 Fury. It does what I need it to do for testing simple things like power management. The internal ticket should be good @ppanchad-amd . Can you add this info to it as well? Thanks!

ppanchad-amd commented 3 weeks ago

@kentrussell Will do. Thanks!

schung-amd commented 3 days ago

Hi @Alic-Li, an update on this: we've found a driver incompatibility in ROCm 6.2 that can cause the flickering screen + slow app loading issue in some configurations. This is being addressed in future ROCm releases, but for now additional workarounds are installing ROCm using the installer with --usecase=graphics,rocm or updating mesa drivers. If the solution you found for your system is still working for you, great! If not, you can try one of those additional workarounds. Thanks for bringing this to our attention.

Alic-Li commented 3 days ago
sudo apt install amdgpu amdgpu-core amdgpu-lib

--After executing the command, the system desktop environment returns to normal

I'm glad to hear you found the problem. Thank you for your reply! My solution is still work. Can I close this issue? I hope my solution could help others who meet this problem.

schung-amd commented 3 days ago

Sure, if your problem has been solved I think we can close this issue. Feel free to reopen it if your solution stops working. Thanks again for your report and investigation!