Closed alvinshan closed 1 year ago
In ROCm 5.7, the 7900XTX has official support in Ubuntu 22.04 (see https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html ) .
When you say that it's not supported, what are you seeing to make you think that? There shouldn't be any issues getting the GPU to be recognized and initialized successfully in the kernel, provided you're in Ubuntu 22.04. Even outside of that specific distro, it shouldn't be unsupported.
In ROCm 5.7, the 7900XTX has official support in Ubuntu 22.04 (see https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html ) .
When you say that it's not supported, what are you seeing to make you think that? There shouldn't be any issues getting the GPU to be recognized and initialized successfully in the kernel, provided you're in Ubuntu 22.04. Even outside of that specific distro, it shouldn't be unsupported.
I am not using the official ubuntu kernel. Use the latest ROCK-Kernl-Driver kernel, which is the latest version of this repository. When I use RX7900XTX under this kernel, the graphics card does not work.
What I want to ask is ROCK-Kernel-Driver compiled linux kernel, when to support RX7900XTX?
Best regards,
It will be hard to say what the timeframe is, since I don't know what the issue is. Can you attach a full dmesg, as well as what you tried unsuccessfully? The ROCm 6.0 release should flesh out the kernel support a bit better, but there could be other issues that might be causing the problem. Saying that it doesn't work could mean anything from a lack of power on, a bad IFWI, missing GFX version checks in the kernel, a missing KCL definition, ROCr not recognizing the GFX family, HIP not identifying it correctly, etc.
Below is dmesg . In addition, I checked the relevant information of ubuntu22.04 on RX7900XTX, and the official release is deb installation package. When will linux open source code support for RX7900XTX be released? Best regards
[ 6.450739] [drm] amdgpu kernel modesetting enabled.
[ 6.450743] [drm] amdgpu version: 6.1.0
[ 6.450744] [drm] OS DRM version: 6.2.8
[ 6.450936] amdgpu: CRAT table not found
[ 6.450940] amdgpu: Virtual CRAT table created for CPU
[ 6.450951] amdgpu: Topology: Add CPU node
[ 6.455190] [drm] initializing kernel modesetting (IP DISCOVERY 0x1002:0x744C 0x1DA2:0x471E 0xC8).
[ 6.455209] [drm] register mmio base: 0xFE800000
[ 6.455210] [drm] register mmio size: 1048576
[ 6.458861] [drm] add ip block number 0 <soc21_common>
[ 6.458863] [drm] add ip block number 1 <gmc_v11_0>
[ 6.458864] [drm] add ip block number 2 <ih_v6_0>
[ 6.458865] [drm] add ip block number 3 <psp>
[ 6.458866] [drm] add ip block number 4 <smu>
[ 6.458867] [drm] add ip block number 5 <dm>
[ 6.458868] [drm] add ip block number 6 <gfx_v11_0>
[ 6.458870] [drm] add ip block number 7 <sdma_v6_0>
[ 6.458871] [drm] add ip block number 8 <vcn_v4_0>
[ 6.458872] [drm] add ip block number 9 <jpeg_v4_0>
[ 6.458873] [drm] add ip block number 10 <mes_v11_0>
[ 6.470154] [drm] BIOS signature incorrect ff ff
[ 6.517953] amdgpu 0000:01:00.0: amdgpu: Fetched VBIOS from ROM BAR
[ 6.517958] amdgpu: ATOM BIOS: 113-3E4710U-O4W
[ 6.518515] amdgpu 0000:01:00.0: amdgpu: CP RS64 enable
[ 6.518798] [drm] VCN(0) encode/decode are enabled in VM mode
[ 6.518799] [drm] VCN(1) encode/decode are enabled in VM mode
[ 6.518928] amdgpu 0000:01:00.0: [drm:jpeg_v4_0_early_init [amdgpu]] JPEG decode is enabled in VM mode
[ 6.519374] amdgpu 0000:01:00.0: Direct firmware load for amdgpu/gc_11_0_0_mes_2.bin failed with error -2
[ 6.519377] [drm] try to fall back to amdgpu/gc_11_0_0_mes.bin
[ 6.519728] Console: switching to colour dummy device 80x25
[ 6.519785] amdgpu 0000:01:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[ 6.519805] amdgpu 0000:01:00.0: amdgpu: PCIE atomic ops is not supported
[ 6.519955] amdgpu 0000:01:00.0: amdgpu: MEM ECC is not presented.
[ 6.519956] amdgpu 0000:01:00.0: amdgpu: SRAM ECC is not presented.
[ 6.519966] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[ 6.527765] amdgpu 0000:01:00.0: BAR 2: releasing [mem 0xe0000000-0xe01fffff 64bit pref]
[ 6.527791] amdgpu 0000:01:00.0: BAR 0: releasing [mem 0xd0000000-0xdfffffff 64bit pref]
[ 6.528053] amdgpu 0000:01:00.0: BAR 0: assigned [mem 0xd0000000-0xdfffffff 64bit pref]
[ 6.528129] amdgpu 0000:01:00.0: BAR 2: assigned [mem 0xe0000000-0xe01fffff 64bit pref]
[ 6.540897] amdgpu 0000:01:00.0: amdgpu: VRAM: 24560M 0x0000008000000000 - 0x00000085FEFFFFFF (24560M used)
[ 6.540902] amdgpu 0000:01:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[ 6.540904] amdgpu 0000:01:00.0: amdgpu: AGP: 267878400M 0x0000008800000000 - 0x0000FFFFFFFFFFFF
[ 6.540929] [drm] Detected VRAM RAM=24560M, BAR=256M
[ 6.540931] [drm] RAM width 384bits GDDR6
[ 6.542734] [drm] amdgpu: 24560M of VRAM memory ready
[ 6.542737] [drm] amdgpu: 12011M of GTT memory ready.
[ 6.542786] [drm] GART: num cpu pages 131072, num gpu pages 131072
[ 6.542894] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
[ 6.559288] [drm] Loading DMUB firmware via PSP: version=0x07000A01
[ 6.559753] [drm] Found VCN firmware Version ENC: 1.9 DEC: 5 VEP: 0 Revision: 1
[ 6.559770] amdgpu 0000:01:00.0: amdgpu: Will use PSP to load VCN firmware
[ 6.560541] [drm] max_doorbell_slices=255
[ 6.713768] [drm] reserve 0x1300000 from 0x85fc000000 for PSP TMR
[ 6.843508] amdgpu 0000:01:00.0: amdgpu: RAP: optional rap ta ucode is not available
[ 6.843515] amdgpu 0000:01:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[ 6.843578] amdgpu 0000:01:00.0: amdgpu: smu driver if version = 0x0000003d, smu fw if version = 0x00000034, smu fw program = 0, smu fw version = 0x004e4b00 (78.75.0)
[ 6.843587] amdgpu 0000:01:00.0: amdgpu: SMU driver if version not matched
[ 7.003830] amdgpu 0000:01:00.0: amdgpu: SMU is initialized successfully!
[ 7.004438] [drm] Display Core v3.2.241 initialized on DCN 3.2
[ 7.004440] [drm] DP-HDMI FRL PCON supported
[ 7.006544] [drm] DMUB hardware initialized: version=0x07000A01
[ 7.047959] [drm] kiq ring mec 3 pipe 1 q 0
[ 7.061182] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[ 7.061392] amdgpu 0000:01:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.
[ 7.218189] memmap_init_zone_device initialised 6291456 pages in 64ms
[ 7.218201] amdgpu: HMM registered 24560MB device memory
[ 7.218215] kfd kfd: amdgpu: skipped device 1002:744c, PCI rejects atomics 494<509
[ 7.218245] amdgpu 0000:01:00.0: amdgpu: SE 6, SH per SE 2, CU per SH 8, active_cu_number 96
[ 7.218259] amdgpu 0000:01:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 7.218261] amdgpu 0000:01:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 7.218262] amdgpu 0000:01:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 7.218263] amdgpu 0000:01:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[ 7.218265] amdgpu 0000:01:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[ 7.218266] amdgpu 0000:01:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[ 7.218268] amdgpu 0000:01:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[ 7.218269] amdgpu 0000:01:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[ 7.218270] amdgpu 0000:01:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[ 7.218272] amdgpu 0000:01:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 7.218273] amdgpu 0000:01:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[ 7.218274] amdgpu 0000:01:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[ 7.218276] amdgpu 0000:01:00.0: amdgpu: ring vcn_unified_1 uses VM inv eng 1 on hub 8
[ 7.218277] amdgpu 0000:01:00.0: amdgpu: ring jpeg_dec uses VM inv eng 4 on hub 8
[ 7.218278] amdgpu 0000:01:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 14 on hub 0
[ 7.219965] [drm] ring gfx_32768.1.1 was added
[ 7.220397] [drm] ring compute_32768.2.2 was added
[ 7.220746] [drm] ring sdma_32768.3.3 was added
[ 7.220829] [drm] ring gfx_32768.1.1 ib test pass
[ 7.220901] [drm] ring compute_32768.2.2 ib test pass
[ 7.220935] [drm] ring sdma_32768.3.3 ib test pass
[ 7.242359] amdgpu 0000:01:00.0: amdgpu: Using BACO for runtime pm
[ 7.243101] [drm] Initialized amdgpu 3.54.0 20150101 for 0000:01:00.0 on minor 0
[ 7.245472] amdgpu 0000:01:00.0: [drm] Cannot find any crtc or sizes
So the code here in this repo is the same as the code in the amdgpu-dkms deb, which means that this code would support your GPU, at least in Ubuntu 22.04 (which is what we tested). There's nothing in the code here preventing you from using your GPU. Also, from the log above, the GPU appears to be initialized correctly: [ 7.243101] [drm] Initialized amdgpu 3.54.0 20150101 for 0000:01:00.0 on minor 0
It just doesn't have any monitors detected. Even all of the ring tests passed. What makes you think that it's not working?
You can even check that DPM is working properly by running rocm-smi or rocminfo. If you haven't installed the full ROCm stack, then you can check some generic files like "cat /sys/class/drm/card0/device/vbios_version" to see that it was populated correctly (should return 113-3E4710U-O4W in your case)
I think the problem may be in the kgd2kfd_device_init function. dmesg has a log for this.
[ 7.218215] kfd kfd: amdgpu: skipped device 1002:744c, PCI rejects atomics 494<509
Devices that use deb to install the driver to boot normally will not have this log.
It seems that the mec fw version number did not pass the check (494 is less than 509 resulting in a direct return). I looked at the normal mec fw version number 528 (using ubuntu deb to install the driver). So, why does the mec fw get 494 and not 528? As far as I know, mec fw is extracted from the header information of fw bin. The file gc_11_0_0_mec.bin is used when problems occur. Using deb to install a normally booted device, I found all the header information in the mec bin file, but no version 528. What could possibly be the problem?
An easy way to check the MEC FW is by "cat /sys/class/drm/card*/device/fw_version/mec_fw_version" . The version is indeed part of that .bin file, so there's a chance that it's possibly grabbing the wrong one. For our ROCm releases, we have the amdgpu-dkms-firmware package, which contains all of the required FW. It gets installed to /lib/firmware/updates/amdgpu . If you're using a monolithic or custom-built kernel, there's a chance that the system is grabbing the FW from /lib/firmware/amdgpu first, which could be the outdated firmware. I checked and the 5.7 release has v550, which would definitely be beyond that 494 MEC requirement. To be 100% sure, if you xxd the .bin file (in case they differ and the kernel is grabbing the distro-provided one), line 10 should have the version as the first block. 00000010: 2602 0000 2035 0600 0001 0000 8b6f a881 &... 5.......o..
2602 = 02 26 = 0x226 = 550 That's what I grabbed on the 5.7 release, as an example
Thank you very much for your answer. The problem is indeed caused by an outdated firmware version. Our driver gets the FW from the /lib/firmware/amdgpu directory.
Excellent! So after getting it pointed to the newer FW .bin files, is it initializing correctly now?
Yes, the driver can be properly initialized with the new FW bin file. The device works normally.
Awesome. I'll close this off then. Have a great week!
Hello. Thanks for your work. When I ran the RX7900XTX GPU, I found that the latest ROCK-Kernel-Driver did not support the graphics card. When can the driver for the RX7900XTX be released?
Best regards,