ROCm / ROCK-Kernel-Driver

AMDGPU Driver with KFD used by the ROCm project. Also contains the current Linux Kernel that matches this base driver
Other
319 stars 97 forks source link

Does latest Driver branch support and recognize GFX90A? #115

Closed ghostplant closed 2 weeks ago

ghostplant commented 3 years ago

rock-dkms from ROCm 4.3.1-release fails to recognize GFX90A GPUs. How about the latest branch status?

kentrussell commented 3 years ago

Can you provide a full dmesg? GFX90A should be recognized in 4.3.1 from the ROCK side.

ghostplant commented 3 years ago
$ lspci | grep Displ
0e:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 740c (rev 01)
11:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 740c (rev 01)
16:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 740c (rev 01)
19:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 740c (rev 01)
...
...
$ dmesg | grep AMD
[    0.000000]   AMD AuthenticAMD
[    0.021179] RAMDISK: [mem 0x387df000-0x3d25efff]
[    0.021189] ACPI: RSDP 0x00000000A6EB4014 000024 (v02 AMD   )
[    0.021193] ACPI: XSDT 0x00000000A6EB3728 0000FC (v01 AMD    ETHANOLX 03242016 AMI  01000013)
[    0.021199] ACPI: FACP 0x00000000A6A7A000 000114 (v06 AMD    ETHANOLX 03242016 AMI  00010013)
[    0.021204] ACPI: DSDT 0x00000000A6A69000 010EC6 (v02 AMD    ETHANOLX 03242016 INTL 20120913)
[    0.021210] ACPI: SSDT 0x00000000A6A7C000 00094E (v02 AMD    AmdTable 00000002 MSFT 02000002)
[    0.021213] ACPI: SPMI 0x00000000A6A7B000 000041 (v05 AMD    ETHANOLX 00000000 AMI. 00000000)
[    0.021216] ACPI: FPDT 0x00000000A6A68000 000044 (v01 AMD    ETHANOLX 03242016 AMI  00010013)
[    0.021219] ACPI: FIDT 0x00000000A6A67000 00009C (v01 AMD    ETHANOLX 03242016 AMI  00010013)
[    0.021222] ACPI: MCFG 0x00000000A6A66000 00003C (v01 AMD    ETHANOLX 03242016 MSFT 00010013)
[    0.021225] ACPI: SSDT 0x00000000A6A65000 000EAC (v02 AMD    CPUSSDT  03242016 AMI  03242016)
[    0.021228] ACPI: SSDT 0x00000000A6A64000 000110 (v01 AMD    CPMRAS   00000001 INTL 20120913)
[    0.021231] ACPI: BERT 0x00000000A6A63000 000030 (v01 AMD    AMD BERT 00000001 AMD  00000001)
[    0.021234] ACPI: EINJ 0x00000000A6A61000 000150 (v01 AMD    AMD EINJ 00000001 AMD  00000001)
[    0.021237] ACPI: HPET 0x00000000A6A60000 000038 (v01 AMD    ETHANOLX 03242016 AMI  00000005)
[    0.021240] ACPI: UEFI 0x00000000A6EA5000 000042 (v01 AMD    ETHANOLX 01072009 AMI  01000013)
[    0.021246] ACPI: TPM2 0x00000000A6A5E000 000034 (v04 AMD    ETHANOLX 00000001 AMI  00000000)
[    0.021249] ACPI: IVRS 0x00000000A6A5D000 000370 (v02 AMD    AmdTable 00000001 AMD  00000000)
[    0.021252] ACPI: PCCT 0x00000000A6A5C000 00006E (v02 AMD    AmdTable 00000001 AMD  00000000)
[    0.021254] ACPI: SSDT 0x00000000A6A42000 019DA4 (v01 AMD    AmdTable 00000001 AMD  00000001)
[    0.021257] ACPI: SRAT 0x00000000A6A41000 0008F8 (v03 AMD    AmdTable 00000001 AMD  00000001)
[    0.021260] ACPI: MSCT 0x00000000A6A40000 00004E (v01 AMD    AmdTable 00000000 AMD  00000001)
[    0.021263] ACPI: SLIT 0x00000000A6A3F000 00003C (v01 AMD    AmdTable 00000001 AMD  00000001)
[    0.021266] ACPI: CRAT 0x00000000A6A30000 00E948 (v01 AMD    AmdTable 00000001 AMD  00000001)
[    0.021269] ACPI: CDIT 0x00000000A6A2F000 000038 (v01 AMD    AmdTable 00000001 AMD  00000001)
[    0.021272] ACPI: SSDT 0x00000000A6A2D000 0017DC (v01 AMD    CPMCMN   00000001 INTL 20120913)
[    0.021275] ACPI: WSMT 0x00000000A6A2C000 000028 (v01 AMD    ETHANOLX 03242016 AMI  00010013)
[    0.021278] ACPI: APIC 0x00000000A6A2B000 0008B2 (v04 AMD    ETHANOLX 03242016 AMI  00010013)
[    0.021280] ACPI: HEST 0x00000000A69BA000 070A74 (v01 AMD    AMD HEST 00000001 AMD  00000001)
[    7.086273] Spectre V2 : Mitigation: Full AMD retpoline
[    7.197102] smpboot: CPU0: AMD EPYC 7V12 64-Core Processor (family: 0x17, model: 0x31, stepping: 0x0)
[    7.197373] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[    8.367422] pci 0000:6f:00.2: AMD-Vi: IOMMU performance counters supported
[    8.367469] pci 0000:40:00.2: AMD-Vi: IOMMU performance counters supported
[    8.367508] pci 0000:2f:00.2: AMD-Vi: IOMMU performance counters supported
[    8.367534] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[    8.367584] pci 0000:ef:00.2: AMD-Vi: IOMMU performance counters supported
[    8.367619] pci 0000:c1:00.2: AMD-Vi: IOMMU performance counters supported
[    8.367670] pci 0000:b0:00.2: AMD-Vi: IOMMU performance counters supported
[    8.367713] pci 0000:81:00.2: AMD-Vi: IOMMU performance counters supported
[    8.537476] pci 0000:6f:00.2: AMD-Vi: Found IOMMU cap 0x40
[    8.537484] pci 0000:6f:00.2: AMD-Vi: Extended features (0x58f77ef22294ade):
[    8.537490] pci 0000:40:00.2: AMD-Vi: Found IOMMU cap 0x40
[    8.537491] pci 0000:40:00.2: AMD-Vi: Extended features (0x58f77ef22294ade):
[    8.537495] pci 0000:2f:00.2: AMD-Vi: Found IOMMU cap 0x40
[    8.537496] pci 0000:2f:00.2: AMD-Vi: Extended features (0x58f77ef22294ade):
[    8.537500] pci 0000:00:00.2: AMD-Vi: Found IOMMU cap 0x40
[    8.537501] pci 0000:00:00.2: AMD-Vi: Extended features (0x58f77ef22294ade):
[    8.537505] pci 0000:ef:00.2: AMD-Vi: Found IOMMU cap 0x40
[    8.537506] pci 0000:ef:00.2: AMD-Vi: Extended features (0x58f77ef22294ade):
[    8.537510] pci 0000:c1:00.2: AMD-Vi: Found IOMMU cap 0x40
[    8.537510] pci 0000:c1:00.2: AMD-Vi: Extended features (0x58f77ef22294ade):
[    8.537514] pci 0000:b0:00.2: AMD-Vi: Found IOMMU cap 0x40
[    8.537515] pci 0000:b0:00.2: AMD-Vi: Extended features (0x58f77ef22294ade):
[    8.537519] pci 0000:81:00.2: AMD-Vi: Found IOMMU cap 0x40
[    8.537519] pci 0000:81:00.2: AMD-Vi: Extended features (0x58f77ef22294ade):
[    8.537523] AMD-Vi: Interrupt remapping enabled
[    8.537523] AMD-Vi: Virtual APIC enabled
[    8.537524] AMD-Vi: X2APIC enabled
[    8.538325] AMD-Vi: Lazy IO/TLB flushing enabled
[    8.544101] perf: AMD IBS detected (0x000003ff)
[    8.544120] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
[    8.544138] perf/amd_iommu: Detected AMD IOMMU #1 (2 banks, 4 counters/bank).
[    8.544157] perf/amd_iommu: Detected AMD IOMMU #2 (2 banks, 4 counters/bank).
[    8.544175] perf/amd_iommu: Detected AMD IOMMU #3 (2 banks, 4 counters/bank).
[    8.544198] perf/amd_iommu: Detected AMD IOMMU #4 (2 banks, 4 counters/bank).
[    8.544217] perf/amd_iommu: Detected AMD IOMMU #5 (2 banks, 4 counters/bank).
[    8.544238] perf/amd_iommu: Detected AMD IOMMU #6 (2 banks, 4 counters/bank).
[    8.544258] perf/amd_iommu: Detected AMD IOMMU #7 (2 banks, 4 counters/bank).
[   10.124966] AMD-Vi: AMD IOMMUv2 driver by Joerg Roedel <jroedel@suse.de>
$ ls /dev/kfd
/dev/kfd
$ ls /dev/dri/
ls: cannot access '/dev/dri/': No such file or directory
$ dpkg -l | grep rock
ii  rock-dkms                                  1:4.3-59                                all          amdgpu driver in DKMS format.
ii  rock-dkms-firmware                         1:4.3-59                                all          firmware blobs used by amdgpu driver in DKMS format
$ uname -a
Linux mi200ev2-linux 5.11.0-27-generic #29~20.04.1-Ubuntu SMP Wed Aug 11 15:58:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
kentrussell commented 3 years ago

Can we get a full dmesg, without the grep? (If there's something sensitive, feel free to cut that stuff out, I want to see everything from drm, pci, amd, amdgpu, amdkfd and kfd, so the grep doesn't help much there) Note that DID 0x740C is supported in 4.3.1: https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/blob/rocm-4.3.1/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c#L1223 So let's try to track this down, to get you up and running

ghostplant commented 3 years ago

dmesg.txt

Attached.

kentrussell commented 3 years ago

Alright, so it looks like amdgpu never even tries to load. Let's try a couple things: 1) What does "dkms status" return? If it returns "installed" as the status, try "sudo modprobe amdgpu" and see if it comes up. If it only returns "added" then there was an installation failure, and you can usually find the log at /var/lib/dkms/amdgpu-$VER/build/make.log (where $VER is the version of rock-dkms that you installed).

Note that this path would also be printed out during the installation of rock-dkms saying "Errors occurred, consult /var/lib/..... for more information" or something to that effect. Hopefully it's just a little compilation error and we can address it

ghostplant commented 3 years ago
$ dpkg -l | grep rock
ii  rock-dkms                                  1:4.3-59                                all          amdgpu driver in DKMS format.
ii  rock-dkms-firmware                         1:4.3-59                                all          firmware blobs used by amdgpu driver in DKMS format

$ dkms status

$ modprobe amdgpu

$ lsmod | grep amdgpu
amdgpu               6053888  0
iommu_v2               24576  1 amdgpu
gpu_sched              40960  1 amdgpu
drm_ttm_helper         16384  2 drm_vram_helper,amdgpu
ttm                    73728  3 drm_vram_helper,amdgpu,drm_ttm_helper
drm_kms_helper        237568  5 drm_vram_helper,ast,amdgpu
i2c_algo_bit           16384  2 ast,amdgpu
drm                   548864  9 gpu_sched,drm_kms_helper,drm_vram_helper,ast,amdgpu,drm_ttm_helper,ttm

$ ls /var/lib/dkms/amdgpu-*
ls: cannot access '/var/lib/dkms/amdgpu-*': No such file or directory
kentrussell commented 3 years ago

Definitely baffled here, since dkms doesn't look like it's even done anything. Normally dkms gets pulled in, so it should provide something. Maybe we can get things working. Is the code in /usr/src/amdgpu-4.3-59 ? dpkg showing that it installed implies that it should be. If so, you can try to get it building via: sudo dkms add amdgpu/4.3-59 sudo dkms build amdgpu/4.3-59 -k $(uname -r)/x86_64 sudo dkms install amdgpu/4.3-59 -k $(uname -r)/x86_64

Let me know how it goes!

ghostplant commented 3 years ago
$ dkms build amdgpu/4.3-59 -k $(uname -r)/x86_64

Kernel preparation unnecessary for this kernel.  Skipping...

Running the pre_build script:
checking for a BSD-compatible install... /bin/install -c
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking how to run the C preprocessor... gcc -E
checking kernel source directory... /usr/src/linux-headers-5.11.0-27-generic
checking kernel build directory... /usr/src/linux-headers-5.11.0-27-generic
checking kernel source version... 5.11.0-27-generic
checking kernel file name for module symbols... Module.symvers
checking for linux/overflow.h... yes
checking for linux/sched/mm.h... yes
checking for linux/sched/task.h... yes
checking for linux/sched/signal.h... yes
checking for linux/nospec.h... yes
checking for linux/bits.h... yes
checking for linux/io-64-nonatomic-lo-hi.h... yes
checking for asm/set_memory.h... yes
checking for asm/fpu/api.h... yes
checking for uapi/linux/sched/types.h... yes
checking for linux/compiler_attributes.h... yes
checking for linux/dma-fence.h... yes
checking for linux/dma-resv.h... yes
checking for linux/mmap_lock.h... yes
checking for linux/pci-p2pdma.h... yes
checking for linux/dma-attrs.h... no
checking for linux/mem_encrypt.h... yes
checking for linux/dma-buf-map.h... yes
checking for drm/drm_backport.h... no
checking for drm/amdgpu_pciid.h... no
checking for drm/drm_auth.h... yes
checking for drm/drm_irq.h... yes
checking for drm/drm_connector.h... yes
checking for drm/drm_encoder.h... yes
checking for drm/drm_plane.h... yes
checking for drm/drm_print.h... yes
checking for drm/drm_drv.h... yes
checking for drm/drm_file.h... yes
checking for drm/drm_debugfs.h... yes
checking for drm/drm_ioctl.h... yes
checking for drm/drm_vblank.h... yes
checking for drm/drm_device.h... yes
checking for drm/drm_gem_framebuffer_helper.h... yes
checking for drm/drm_hdcp.h... yes
checking for drm/drm_audio_component.h... yes
checking for drm/drm_util.h... yes
checking for drm/drm_atomic_uapi.h... yes
checking for drm/drm_probe_helper.h... yes
checking for drm/drmP.h... no
checking for drm/task_barrier.h... yes
checking for drm/drm_managed.h... yes
checking for drm/drm_gem_ttm_helper.h... yes
checking for module configuration... done
configure: creating ./config.status
config.status: creating config/config.h

Building module:
cleaning build area...(bad exit status: 2)
make -j128 KERNELRELEASE=5.11.0-27-generic -j128 TTM_NAME=amdttm SCHED_NAME=amd-sched -C /lib/modules/5.11.0-27-generic/build M=/var/lib/dkms/amdgpu/4.3-59/build.....
Signing module:
 - /var/lib/dkms/amdgpu/4.3-59/5.11.0-27-generic/x86_64/module/amdgpu.ko
 - /var/lib/dkms/amdgpu/4.3-59/5.11.0-27-generic/x86_64/module/amd-sched.ko
 - /var/lib/dkms/amdgpu/4.3-59/5.11.0-27-generic/x86_64/module/amdttm.ko
 - /var/lib/dkms/amdgpu/4.3-59/5.11.0-27-generic/x86_64/module/amdkcl.ko
Secure Boot not enabled on this system.
cleaning build area...(bad exit status: 2)

DKMS: build completed.

$ dkms install amdgpu/4.3-59 -k $(uname -r)/x86_64
Forcing installation of amdgpu

amdgpu.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/5.11.0-27-generic/updates/dkms/

amdttm.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/5.11.0-27-generic/updates/dkms/

amdkcl.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/5.11.0-27-generic/updates/dkms/

amd-sched.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/5.11.0-27-generic/updates/dkms/

Running the post_install script:

depmod...

Backing up initrd.img-5.11.0-27-generic to /boot/initrd.img-5.11.0-27-generic.old-dkms
Making new initrd.img-5.11.0-27-generic
(If next boot fails, revert to initrd.img-5.11.0-27-generic.old-dkms image)
update-initramfs........

DKMS: install completed.

$ dkms status
amdgpu, 4.3-59, 5.11.0-27-generic, x86_64: installed

$modprobe amdgpu

$ dmesg
..
[25646.268125] pcieport 0000:98:00.0:   bridge window [mem 0x700d0000000-0x700efffffff 64bit pref]
[25646.268130] pcieport 0000:99:00.0: PCI bridge to [bus 9a]
[25646.268135] pcieport 0000:99:00.0:   bridge window [mem 0xb1000000-0xb10fffff]
[25646.268138] pcieport 0000:99:00.0:   bridge window [mem 0x700d0000000-0x700efffffff 64bit pref]
[25646.268151] [drm] Not enough PCI address space for a large BAR.
[25646.268152] amdgpu 0000:9a:00.0: BAR 0: assigned [mem 0x700d0000000-0x700dfffffff 64bit pref]
[25646.268162] amdgpu 0000:9a:00.0: BAR 2: assigned [mem 0x700e0000000-0x700e01fffff 64bit pref]
[25646.268184] amdgpu 0000:9a:00.0: amdgpu: VRAM: 65520M 0x0000024000000000 - 0x0000024FFEFFFFFF (65520M used)
[25646.268186] amdgpu 0000:9a:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[25646.268187] amdgpu 0000:9a:00.0: amdgpu: AGP: 265289728M 0x0000030000000000 - 0x0000FFFFFFFFFFFF
[25646.268195] [drm] Detected VRAM RAM=65520M, BAR=256M
[25646.268196] [drm] RAM width 4096bits HBM
[25646.268216] [drm] amdgpu: 65520M of VRAM memory ready
[25646.268218] [drm] amdgpu: 2064175M of GTT memory ready.
[25646.268220] [drm] GART: num cpu pages 131072, num gpu pages 131072
[25646.268345] [drm] PCIE GART of 512M enabled.
[25646.268346] [drm] PTB located at 0x0000024000000000
[25646.269793] [drm] Found VCN firmware Version ENC: 1.1 DEC: 1 VEP: 0 Revision: 21
[25646.269799] [drm] PSP loading VCN firmware
[25646.594906] [drm:psp_hw_start [amdgpu]] *ERROR* PSP load sos failed!
[25646.596219] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed
[25646.597416] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22
[25646.598591] amdgpu 0000:9a:00.0: amdgpu: amdgpu_device_ip_init failed
[25646.599539] amdgpu 0000:9a:00.0: amdgpu: Fatal error during GPU init
[25646.600522] amdgpu: probe of 0000:9a:00.0 failed with error -22
ghostplant commented 3 years ago
$ ls /var/lib/dkms/amdgpu*
4.3-59  kernel-5.11.0-27-generic-x86_64

$ ls /var/lib/dkms/amdgpu/4.3-59/5.11.0-27-generic/x86_64/log/make.log
/var/lib/dkms/amdgpu/4.3-59/5.11.0-27-generic/x86_64/log/make.log

make.log

ghostplant commented 3 years ago

@kentrussell Is it related to improper BIOS settings?

kentrussell commented 3 years ago

It definitely appears to be possible to be the BIOS. This run through, the PSP didn't load correctly but at least it tried. For the most part, this error is usually addressed with newer firmware (though 4.3.1 has the latest firmware for GFX90A, so this wouldn't be the fix here), or SBIOS/VBIOS updates.

I'd start with updating the SBIOS, ensuring that "Above 4G decoding" is enabled, and ensuring that you've got the latest base-kernel (5.8 HWE) installed, which I believe is 5.8.0-65.73 , instead of the 5.8.0-43 that you have installed there. Good luck! I'll do some more digging on this side as well to see if we have some more things to try to pursue. Out of curiousity, do you have any older-generation GPUs around that you can swap in, just to see if the HW is set up correctly? Swapping the card for something like a Vega20 or FIji or anything newer than Hawaii should just be "plug-and-play", so if you drop it in the system, it should just work. If it works with an older GPU, then it could be a HW issue with that card, or it could be that some support is still missing for that card.

Theoretically the kernel should support it, even though it's not on the officially-supported-hardware list. That's why I want to keep working through this, even though the Official ROCm documentation doesn't list support for it yet (should be in ROCm 4.5 officially, IIRC)

EDIT: Just to cover all of our bases, let's make sure that the FW is actually installed correctly (since PSP is the first FW block to load). Is there a folder called /lib/firmware/updates/amdgpu on your system? And if you do a lsinitramfs on the booted ramfs image, is the firmware located in lib/firmware/updates/amdgpu ?

ghostplant commented 3 years ago
$ ls /lib/firmware/updates/amdgpu
aldebaran_mec2.bin     carrizo_mec2.bin            hainan_mc.bin      navi10_asd.bin       navi14_sos.bin           polaris10_k2_smc.bin  polaris12_mc.bin      renoir_mec.bin            tonga_sdma.bin       vega12_mec2.bin
aldebaran_mec.bin      carrizo_mec.bin             hainan_me.bin      navi10_ce.bin        navi14_ta.bin            polaris10_k_mc.bin    polaris12_me_2.bin    renoir_pfp.bin            tonga_smc.bin        vega12_mec.bin
aldebaran_rlc.bin      carrizo_pfp.bin             hainan_pfp.bin     navi10_gpu_info.bin  navi14_vcn.bin           polaris10_k_smc.bin   polaris12_me.bin      renoir_rlc.bin            tonga_uvd.bin        vega12_pfp.bin
aldebaran_sdma.bin     carrizo_rlc.bin             hainan_rlc.bin     navi10_me.bin        navy_flounder_ce.bin     polaris10_mc.bin      polaris12_mec2_2.bin  renoir_sdma.bin           tonga_vce.bin        vega12_rlc.bin
aldebaran_smc.bin      carrizo_sdma1.bin           hainan_smc.bin     navi10_mec2.bin      navy_flounder_dmcub.bin  polaris10_me_2.bin    polaris12_mec_2.bin   renoir_ta.bin             topaz_ce.bin         vega12_sdma1.bin
aldebaran_sos.bin      carrizo_sdma.bin            hawaii_ce.bin      navi10_mec.bin       navy_flounder_me.bin     polaris10_me.bin      polaris12_mec2.bin    renoir_vcn.bin            topaz_k_smc.bin      vega12_sdma.bin
aldebaran_ta.bin       carrizo_uvd.bin             hawaii_k_smc.bin   navi10_pfp.bin       navy_flounder_mec2.bin   polaris10_mec2_2.bin  polaris12_mec.bin     si58_mc.bin               topaz_mc.bin         vega12_smc.bin
aldebaran_vcn.bin      carrizo_vce.bin             hawaii_mc.bin      navi10_rlc.bin       navy_flounder_mec.bin    polaris10_mec_2.bin   polaris12_pfp_2.bin   sienna_cichlid_ce.bin     topaz_me.bin         vega12_sos.bin
arcturus_asd.bin       dimgrey_cavefish_ce.bin     hawaii_me.bin      navi10_sdma1.bin     navy_flounder_pfp.bin    polaris10_mec2.bin    polaris12_pfp.bin     sienna_cichlid_dmcub.bin  topaz_mec2.bin       vega12_uvd.bin
arcturus_gpu_info.bin  dimgrey_cavefish_dmcub.bin  hawaii_mec.bin     navi10_sdma.bin      navy_flounder_rlc.bin    polaris10_mec.bin     polaris12_rlc.bin     sienna_cichlid_me.bin     topaz_mec.bin        vega12_vce.bin
arcturus_mec2.bin      dimgrey_cavefish_me.bin     hawaii_pfp.bin     navi10_smc.bin       navy_flounder_sdma.bin   polaris10_pfp_2.bin   polaris12_sdma1.bin   sienna_cichlid_mec2.bin   topaz_pfp.bin        vega20_asd.bin
arcturus_mec.bin       dimgrey_cavefish_mec2.bin   hawaii_rlc.bin     navi10_sos.bin       navy_flounder_smc.bin    polaris10_pfp.bin     polaris12_sdma.bin    sienna_cichlid_mec.bin    topaz_rlc.bin        vega20_ce.bin
arcturus_rlc.bin       dimgrey_cavefish_mec.bin    hawaii_sdma1.bin   navi10_ta.bin        navy_flounder_sos.bin    polaris10_rlc.bin     polaris12_smc.bin     sienna_cichlid_mes.bin    topaz_sdma1.bin      vega20_me.bin
arcturus_sdma.bin      dimgrey_cavefish_pfp.bin    hawaii_sdma.bin    navi10_vcn.bin       navy_flounder_ta.bin     polaris10_sdma1.bin   polaris12_uvd.bin     sienna_cichlid_pfp.bin    topaz_sdma.bin       vega20_mec2.bin
arcturus_smc.bin       dimgrey_cavefish_rlc.bin    hawaii_smc.bin     navi12_asd.bin       navy_flounder_vcn.bin    polaris10_sdma.bin    polaris12_vce.bin     sienna_cichlid_rlc.bin    topaz_smc.bin        vega20_mec.bin
arcturus_sos.bin       dimgrey_cavefish_sdma.bin   hawaii_uvd.bin     navi12_ce.bin        oland_ce.bin             polaris10_smc.bin     raven2_asd.bin        sienna_cichlid_sdma.bin   vangogh_asd.bin      vega20_pfp.bin
arcturus_ta.bin        dimgrey_cavefish_smc.bin    hawaii_vce.bin     navi12_dmcu.bin      oland_k_smc.bin          polaris10_smc_sk.bin  raven2_ce.bin         sienna_cichlid_smc.bin    vangogh_ce.bin       vega20_rlc.bin
arcturus_vcn.bin       dimgrey_cavefish_sos.bin    kabini_ce.bin      navi12_gpu_info.bin  oland_mc.bin             polaris10_uvd.bin     raven2_gpu_info.bin   sienna_cichlid_sos.bin    vangogh_dmcub.bin    vega20_sdma1.bin
banks_k_2_smc.bin      dimgrey_cavefish_ta.bin     kabini_me.bin      navi12_me.bin        oland_me.bin             polaris10_vce.bin     raven2_me.bin         sienna_cichlid_ta.bin     vangogh_me.bin       vega20_sdma.bin
beige_goby_ce.bin      dimgrey_cavefish_vcn.bin    kabini_mec.bin     navi12_mec2.bin      oland_pfp.bin            polaris11_ce_2.bin    raven2_mec2.bin       sienna_cichlid_vcn.bin    vangogh_mec2.bin     vega20_smc.bin
beige_goby_dmcub.bin   fiji_ce.bin                 kabini_pfp.bin     navi12_mec.bin       oland_rlc.bin            polaris11_ce.bin      raven2_mec.bin        stoney_ce.bin             vangogh_mec.bin      vega20_sos.bin
beige_goby_me.bin      fiji_mc.bin                 kabini_rlc.bin     navi12_pfp.bin       oland_smc.bin            polaris11_k2_smc.bin  raven2_pfp.bin        stoney_me.bin             vangogh_pfp.bin      vega20_ta.bin
beige_goby_mec2.bin    fiji_me.bin                 kabini_sdma1.bin   navi12_rlc.bin       oland_uvd.bin            polaris11_k_mc.bin    raven2_rlc.bin        stoney_mec.bin            vangogh_rlc.bin      vega20_uvd.bin
beige_goby_mec.bin     fiji_mec2.bin               kabini_sdma.bin    navi12_sdma1.bin     picasso_asd.bin          polaris11_k_smc.bin   raven2_sdma.bin       stoney_pfp.bin            vangogh_sdma.bin     vega20_vce.bin
beige_goby_pfp.bin     fiji_mec.bin                kabini_uvd.bin     navi12_sdma.bin      picasso_ce.bin           polaris11_mc.bin      raven2_ta.bin         stoney_rlc.bin            vangogh_toc.bin      vegam_ce.bin
beige_goby_rlc.bin     fiji_pfp.bin                kabini_vce.bin     navi12_smc.bin       picasso_gpu_info.bin     polaris11_me_2.bin    raven2_vcn.bin        stoney_sdma.bin           vangogh_vcn.bin      vegam_me.bin
beige_goby_sdma.bin    fiji_rlc.bin                kaveri_ce.bin      navi12_sos.bin       picasso_me.bin           polaris11_me.bin      raven_asd.bin         stoney_uvd.bin            vega10_acg_smc.bin   vegam_mec2.bin
beige_goby_smc.bin     fiji_sdma1.bin              kaveri_me.bin      navi12_ta.bin        picasso_mec2.bin         polaris11_mec2_2.bin  raven_ce.bin          stoney_vce.bin            vega10_asd.bin       vegam_mec.bin
beige_goby_sos.bin     fiji_sdma.bin               kaveri_mec2.bin    navi12_vcn.bin       picasso_mec.bin          polaris11_mec_2.bin   raven_dmcu.bin        tahiti_ce.bin             vega10_ce.bin        vegam_pfp.bin
beige_goby_ta.bin      fiji_smc.bin                kaveri_mec.bin     navi14_asd.bin       picasso_pfp.bin          polaris11_mec2.bin    raven_gpu_info.bin    tahiti_k_smc.bin          vega10_gpu_info.bin  vegam_rlc.bin
beige_goby_vcn.bin     fiji_uvd.bin                kaveri_pfp.bin     navi14_ce.bin        picasso_rlc_am4.bin      polaris11_mec.bin     raven_kicker_rlc.bin  tahiti_mc.bin             vega10_me.bin        vegam_sdma1.bin
bonaire_ce.bin         fiji_vce.bin                kaveri_rlc.bin     navi14_ce_wks.bin    picasso_rlc.bin          polaris11_pfp_2.bin   raven_me.bin          tahiti_me.bin             vega10_mec2.bin      vegam_sdma.bin
bonaire_k_smc.bin      green_sardine_asd.bin       kaveri_sdma1.bin   navi14_gpu_info.bin  picasso_sdma.bin         polaris11_pfp.bin     raven_mec2.bin        tahiti_pfp.bin            vega10_mec.bin       vegam_smc.bin
bonaire_mc.bin         green_sardine_ce.bin        kaveri_sdma.bin    navi14_me.bin        picasso_ta.bin           polaris11_rlc.bin     raven_mec.bin         tahiti_rlc.bin            vega10_pfp.bin       vegam_uvd.bin
bonaire_me.bin         green_sardine_dmcub.bin     kaveri_uvd.bin     navi14_mec2.bin      picasso_vcn.bin          polaris11_sdma1.bin   raven_pfp.bin         tahiti_smc.bin            vega10_rlc.bin       vegam_vce.bin
bonaire_mec.bin        green_sardine_me.bin        kaveri_vce.bin     navi14_mec2_wks.bin  pitcairn_ce.bin          polaris11_sdma.bin    raven_rlc.bin         tahiti_uvd.bin            vega10_sdma1.bin     verde_ce.bin
bonaire_pfp.bin        green_sardine_mec2.bin      mullins_ce.bin     navi14_mec.bin       pitcairn_k_smc.bin       polaris11_smc.bin     raven_sdma.bin        tonga_ce.bin              vega10_sdma.bin      verde_k_smc.bin
bonaire_rlc.bin        green_sardine_mec.bin       mullins_me.bin     navi14_mec_wks.bin   pitcairn_mc.bin          polaris11_smc_sk.bin  raven_ta.bin          tonga_k_smc.bin           vega10_smc.bin       verde_mc.bin
bonaire_sdma1.bin      green_sardine_pfp.bin       mullins_mec.bin    navi14_me_wks.bin    pitcairn_me.bin          polaris11_uvd.bin     raven_vcn.bin         tonga_mc.bin              vega10_sos.bin       verde_me.bin
bonaire_sdma.bin       green_sardine_rlc.bin       mullins_pfp.bin    navi14_pfp.bin       pitcairn_pfp.bin         polaris11_vce.bin     renoir_asd.bin        tonga_me.bin              vega10_uvd.bin       verde_pfp.bin
bonaire_smc.bin        green_sardine_sdma.bin      mullins_rlc.bin    navi14_pfp_wks.bin   pitcairn_rlc.bin         polaris12_32_mc.bin   renoir_ce.bin         tonga_mec2.bin            vega10_vce.bin       verde_rlc.bin
bonaire_uvd.bin        green_sardine_ta.bin        mullins_sdma1.bin  navi14_rlc.bin       pitcairn_smc.bin         polaris12_ce_2.bin    renoir_dmcub.bin      tonga_mec.bin             vega12_asd.bin       verde_smc.bin
bonaire_vce.bin        green_sardine_vcn.bin       mullins_sdma.bin   navi14_sdma1.bin     pitcairn_uvd.bin         polaris12_ce.bin      renoir_gpu_info.bin   tonga_pfp.bin             vega12_ce.bin        verde_uvd.bin
carrizo_ce.bin         hainan_ce.bin               mullins_uvd.bin    navi14_sdma.bin      polaris10_ce_2.bin       polaris12_k_mc.bin    renoir_me.bin         tonga_rlc.bin             vega12_gpu_info.bin
carrizo_me.bin         hainan_k_smc.bin            mullins_vce.bin    navi14_smc.bin       polaris10_ce.bin         polaris12_k_smc.bin   renoir_mec2.bin       tonga_sdma1.bin           vega12_me.bin

$ lsinitramfs /boot/initrd.img-5.11.0-27-generic | grep amdgpu
usr/lib/firmware/updates/amdgpu
usr/lib/firmware/updates/amdgpu/aldebaran_mec.bin
usr/lib/firmware/updates/amdgpu/aldebaran_mec2.bin
usr/lib/firmware/updates/amdgpu/aldebaran_rlc.bin
usr/lib/firmware/updates/amdgpu/aldebaran_sdma.bin
usr/lib/firmware/updates/amdgpu/aldebaran_smc.bin
usr/lib/firmware/updates/amdgpu/aldebaran_sos.bin
usr/lib/firmware/updates/amdgpu/aldebaran_ta.bin
usr/lib/firmware/updates/amdgpu/aldebaran_vcn.bin
usr/lib/firmware/updates/amdgpu/arcturus_asd.bin
usr/lib/firmware/updates/amdgpu/arcturus_gpu_info.bin
usr/lib/firmware/updates/amdgpu/arcturus_mec.bin
usr/lib/firmware/updates/amdgpu/arcturus_rlc.bin
usr/lib/firmware/updates/amdgpu/arcturus_sdma.bin
usr/lib/firmware/updates/amdgpu/arcturus_smc.bin
usr/lib/firmware/updates/amdgpu/arcturus_sos.bin
usr/lib/firmware/updates/amdgpu/arcturus_ta.bin
usr/lib/firmware/updates/amdgpu/arcturus_vcn.bin
usr/lib/firmware/updates/amdgpu/banks_k_2_smc.bin
usr/lib/firmware/updates/amdgpu/bonaire_ce.bin
usr/lib/firmware/updates/amdgpu/bonaire_k_smc.bin
usr/lib/firmware/updates/amdgpu/bonaire_mc.bin
usr/lib/firmware/updates/amdgpu/bonaire_me.bin
usr/lib/firmware/updates/amdgpu/bonaire_mec.bin
usr/lib/firmware/updates/amdgpu/bonaire_pfp.bin
usr/lib/firmware/updates/amdgpu/bonaire_rlc.bin
usr/lib/firmware/updates/amdgpu/bonaire_sdma.bin
usr/lib/firmware/updates/amdgpu/bonaire_sdma1.bin
usr/lib/firmware/updates/amdgpu/bonaire_smc.bin
usr/lib/firmware/updates/amdgpu/bonaire_uvd.bin
usr/lib/firmware/updates/amdgpu/bonaire_vce.bin
usr/lib/firmware/updates/amdgpu/carrizo_ce.bin
usr/lib/firmware/updates/amdgpu/carrizo_me.bin
usr/lib/firmware/updates/amdgpu/carrizo_mec.bin
usr/lib/firmware/updates/amdgpu/carrizo_mec2.bin
usr/lib/firmware/updates/amdgpu/carrizo_pfp.bin
usr/lib/firmware/updates/amdgpu/carrizo_rlc.bin
usr/lib/firmware/updates/amdgpu/carrizo_sdma.bin
usr/lib/firmware/updates/amdgpu/carrizo_sdma1.bin
usr/lib/firmware/updates/amdgpu/carrizo_uvd.bin
usr/lib/firmware/updates/amdgpu/carrizo_vce.bin
usr/lib/firmware/updates/amdgpu/dimgrey_cavefish_ce.bin
usr/lib/firmware/updates/amdgpu/dimgrey_cavefish_dmcub.bin
usr/lib/firmware/updates/amdgpu/dimgrey_cavefish_me.bin
usr/lib/firmware/updates/amdgpu/dimgrey_cavefish_mec.bin
usr/lib/firmware/updates/amdgpu/dimgrey_cavefish_mec2.bin
usr/lib/firmware/updates/amdgpu/dimgrey_cavefish_pfp.bin
...
...
usr/lib/firmware/updates/amdgpu/verde_me.bin
usr/lib/firmware/updates/amdgpu/verde_pfp.bin
usr/lib/firmware/updates/amdgpu/verde_rlc.bin
usr/lib/firmware/updates/amdgpu/verde_smc.bin
usr/lib/firmware/updates/amdgpu/verde_uvd.bin
usr/lib/modules/5.11.0-27-generic/updates/dkms/amdgpu.ko
usr/lib/udev/rules.d/70-amdgpu.rules
kentrussell commented 3 years ago

Thanks for confirming that. So there's something going off with the PSP there, it could be the VBIOS. You should be able to get a newer one from your point-of-contact from where you got the GPU. If you want to try the other steps first (swapping another GPU in the same slot to make sure that the SBIOS/system is configured correctly, updating the kernel to the latest HWE kernel, updating your SBIOS and enabling "Above 4G decoding") then you can always try the VBIOS last, depending on how long it takes to get a new one. At least that way we can try to eliminate the remaining causes, since we have the required PSP FW installed in the ramfs image, and it's known to work in the 4.3.1 release. Good luck!

ghostplant commented 3 years ago

Do you know how to check if current SBIOS version is OK and which version to update?

kentrussell commented 3 years ago

The SBIOS will be the system BIOS, so that'll come from the motherboard manufacturer. You should be able to find that on their support page, or from the point of contact when you obtained the motherboard. For VBIOS (Video BIOS), we don't distribute those through the regular AMD website, so your point-of-contact for the GPU should be able to help there.

ppanchad-amd commented 2 weeks ago

@ghostplant GFX90A is supported in the latest ROCm 6.2. Please create another ticket if you still encounter any issues. Thanks!