QubesOS / qubes-issues

The Qubes OS Project issue tracker
https://www.qubes-os.org/doc/issue-tracking/
532 stars 46 forks source link

Missing amdgpu firmware for AMD Navi GPUs #5416

Open JarrahG opened 4 years ago

JarrahG commented 4 years ago

Qubes OS version:

R4.0 with all updates

Affected component(s) or functionality:


Steps to reproduce the behavior:

Attempt to boot with an AMD RX 5700 XT (likely the standard 5700 as well)

Navi files missing from /lib/firmware/amdgpu

Expected or desired behavior:

Firmware files found. System boots.

Actual behavior:

No firmware files found. System boot hangs at:

Attempting to switch root

General notes:

System was previously installed with an older GTX 780. Still works fine using this GPU.


I have consulted the following relevant documentation:

I couldn't find any documentation on this issue.

I am aware of the following related, non-duplicate issues:

https://github.com/QubesOS/qubes-issues/issues/3796 This is likely the same issue, but for a different GPU.

Thanks.

JarrahG commented 4 years ago

I have tested the navi firmware files from Fedora 30. These result in exactly the same issue. Another part of the system must be unable to support the new card.

Likely candidates:

JarrahG commented 4 years ago

On further analysis, setting up dom0 to output to Xen's serial console allowed me to work out exactly what is happening. The following lines of output occur:

amdgpu [...]: (-14) failed to allocate kernel bo 
amdgpu [...]: failed to create kernel buffer for firmware.
amdgpu [...]: amdgpu_device_ip_init failed
amdgpu [...]: Fatal error during GPU init.

The first line was truncated on screen.

The only location this particular error message can come from is https://github.com/torvalds/linux/blob/574cc4539762561d96b456dbc0544d8898bd4c6e/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c.

From the above message, I still have the Fedora 30 navi firmware installed. I have checked against the linux-firmware repository and they are the same files. This makes me believe #3796 is only part of the problem for this card.

Dom0 has 1-6G allocated, so it seems unlikely this is an OOM problem.

Any help would be much appreciated.

JarrahG commented 4 years ago

I'm now quite confident this is a kernel/amdgpu issue, though maybe one exacerbated by Xen.

I've attempted booting with pinned dom0 memory just to make sure dom0 is getting enough memory to allocate to early on. This has been pinned at 6GiB (6128M). No change.

IOMMU=soft and increasing swiotlb. No change.

Through repeated boots, I have captured further output which looks like Xen is blocking PCI device writes:

pciback [pci device node] Driver tried to write to a read-only configuration space field at offset 0x72, size 2. This may be harmless, but if you have problems with your device.
1) See permissive attribute in sysfs
2) report porblems to the xen-devel mailing list.

This exact error is repeated twice for two different devices.

The result on serial console is different to normal output. Serial:

Normal:

Not sure where to go from here. The above error messages bring little besides the initial commit of the code causing the error.

marmarek commented 4 years ago

pciback message is about devices assigned to a VM - most likely your network card. Unrelated to GPU issue.

Can you try to boot exactly the same kernel but without Xen? In case of grub, edit /boot/grub2/grub.cfg and comment out multiboot2 xen.gz line and change following module2 lines to linux and initrd respectively.

JarrahG commented 4 years ago

Good idea. I hadn't thought of removing Xen from the test.

System boots perfectly into a useable desktop without Xen. My bootline was all but stock for the test using kernel-latest (5.3.7).

JarrahG commented 4 years ago

Testing further on the firmware load issue, I've tried all iterations of amdgpu.fw_load_type. Set to 0 (direct) dom0 alone (no xen) does not boot. The screen turns off, but the system is otherwise responsive. For example, the ctl-alt-del keyboard combination reboots the system. Both other modes boot dom0 fine.

When adding xen, using the direct mode that caused issues above, I get the following error:

[drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block failed

The other two modes result in the kernel panic seen above.

githubhun1 commented 4 years ago

It might be related to the issue what I had when hainan firmware files were missing when I used the amdgpu kernel module. Everything was working with the radeon driver, but when I swithed to amdgpu the modules failed with missing FW files. The solution was to copy the content of /usr/lib/firmware/radeon to /usr/lib/firmware/amdgpu. It seemed, that the older FW files, that could only support the amdgpu module experimentaly - were not included officially into usr/lib/firmware/amdgpu folder. If this helps, you will need to run dracut -f and then install the new initrd in grub.

JarrahG commented 4 years ago

Thanks for the idea. Sadly, this GPU is too new, it was never supported by the radon driver, so there's no firmware there for it. I've just upgraded to dom0-current-testing (kernel 5.3.8-1 and xen 4.8.5-11) with the new linux-firmware package above. It does include the new navi firmware images, but the same issue loading the occurs on boot.

Firmware load error: AMDGPU Error Kernel panic call trace: Calltrace

JarrahG commented 4 years ago

I've built and booted (not yet installed) an R4.1 image (Xen 4.12, kernel-latest 5.3.11). This image boots perfectly fine with the RX 5700 XT. As this seems to be a Xen issue that is already solved in the latest version, I might recommend changing the milestone to R4.1.

JarrahG commented 4 years ago

This issue now seems to be resolved on Qubes R4.0 using Kernel latest 5.6.4-1.

JarrahG commented 4 years ago

As discussed here (https://groups.google.com/forum/#!topic/qubes-users/gq7e-4PTxns), this issue is back. The exact same kernel error is occuring again. This is seemingly with both kernel-latest 5.6.4 (previously working) and 5.6.13.

[   18.511751] [drm] Found VCN firmware Version ENC: 1.7 DEC: 4 VEP: 0 Revision: 13
[   18.512574] [drm] PSP loading VCN firmware
[   18.513857] amdgpu 0000:0a:00.0: (-14) failed to allocate kernel bo
[   18.514389] amdgpu 0000:0a:00.0: failed to create kernel buffer for firmware.fw_buf
[   18.514865] amdgpu 0000:0a:00.0: amdgpu_device_ip_init failed
[   18.515371] amdgpu 0000:0a:00.0: Fatal error during GPU init
[   18.515889] [drm] amdgpu: finishing device.

I've worked out that the system will boot with both this and my old GPU installed, which helps debugging.

JarrahG commented 4 years ago

Issue seems to again be resolved in both 5.7 and 5.8 kernels.

iblue commented 3 years ago

The issue is back.

I am (not) running an AMD RX 5700 on kernel 5.11.4-1.fc25. The display freezes as soon as the amdgpu module is loaded (otherwise system is responsive an can be rebooted for example).

I booted with rd.blacklist=amdgpu modprobe.blacklist=amdgpu into a tty and ran modprobe amdgpu; dmesg > dmesg.log, which freezes the diplay, then rebooted to get the following.

[  119.018714] AMD-Vi: AMD IOMMUv2 functionality not available on this system
[  119.121020] [drm] amdgpu kernel modesetting enabled.
[  119.121155] amdgpu: Ignoring ACPI CRAT on non-APU system
[  119.121159] Virtual CRAT table created for CPU
[  119.121165] amdgpu: Topology: Add CPU node
[  119.121380] amdgpu 0000:0a:00.0: vgaarb: deactivate vga console
[  119.122082] Console: switching to colour dummy device 80x25
[  119.122172] xen: registering gsi 54 triggering 0 polarity 1
[  119.122178] Already setup the GSI :54
[  119.122183] [drm] initializing kernel modesetting (NAVI10 0x1002:0x731F 0x1043:0x04E4 0xC4).
[  119.122185] amdgpu 0000:0a:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[  119.122231] [drm] register mmio base: 0xFC900000
[  119.122232] [drm] register mmio size: 524288
[  119.123627] [drm] add ip block number 0 <nv_common>
[  119.123629] [drm] add ip block number 1 <gmc_v10_0>
[  119.123630] [drm] add ip block number 2 <navi10_ih>
[  119.123631] [drm] add ip block number 3 <psp>
[  119.123631] [drm] add ip block number 4 <smu>
[  119.123633] [drm] add ip block number 5 <dm>
[  119.123634] [drm] add ip block number 6 <gfx_v10_0>
[  119.123635] [drm] add ip block number 7 <sdma_v5_0>
[  119.123635] [drm] add ip block number 8 <vcn_v2_0>
[  119.123636] [drm] add ip block number 9 <jpeg_v2_0>
[  119.123659] amdgpu 0000:0a:00.0: No more image in the PCI ROM
[  119.125114] amdgpu 0000:0a:00.0: amdgpu: Fetched VBIOS from ROM BAR
[  119.125116] amdgpu: ATOM BIOS: 115-D182PI0-100
[  119.125121] [drm] VCN decode is enabled in VM mode
[  119.125121] [drm] VCN encode is enabled in VM mode
[  119.125122] [drm] JPEG decode is enabled in VM mode
[  119.125143] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[  119.125148] amdgpu 0000:0a:00.0: amdgpu: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used)
[  119.125150] amdgpu 0000:0a:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[  119.125151] amdgpu 0000:0a:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF
[  119.125156] [drm] Detected VRAM RAM=8176M, BAR=256M
[  119.125156] [drm] RAM width 256bits GDDR6
[  119.125199] [TTM] Zone  kernel: Available graphics memory: 1990664 KiB
[  119.141821] [drm] amdgpu: 8176M of VRAM memory ready
[  119.141825] [drm] amdgpu: 2916M of GTT memory ready.
[  119.141826] [drm] GART: num cpu pages 131072, num gpu pages 131072
[  119.141959] [drm] PCIE GART of 512M enabled (table at 0x0000008000900000).
[  119.170184] swiotlb_tbl_map_single: 8 callbacks suppressed
[  119.170186] amdgpu 0000:0a:00.0: swiotlb buffer is full (sz: 1048576 bytes), total 32768 (slots), used 0 (slots)
[  119.170241] amdgpu 0000:0a:00.0: amdgpu: (-14) failed to allocate kernel bo
[  119.170243] amdgpu 0000:0a:00.0: amdgpu: (-14) ring create failed
[  119.170244] amdgpu 0000:0a:00.0: amdgpu: (-14) failed to init kiq ring
[  119.170245] [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP block <gfx_v10_0> failed -14
[  119.170335] amdgpu 0000:0a:00.0: amdgpu: amdgpu_device_ip_init failed
[  119.170395] amdgpu 0000:0a:00.0: amdgpu: Fatal error during GPU init
[  119.170435] amdgpu: probe of 0000:0a:00.0 failed with error -14

I also tried with kernels 5.11.8, 5.4.107 and 5.11.12 and got the same behavior (no log written). I am still in the process of testing some other old versions (5.5.7, 5.6.16, 5.8.16).

iblue commented 3 years ago

Same behavior with the other versions. I also tried to boot with swiotlb=65536, I always get the same issue. I suspect it may depend on the xen version (currently running 4.8.5-30.fc35).

marmarek commented 3 years ago

Do you see earlier message about allocating SWIOTLB? There should be something like:

software IO TLB: mapped [mem 0x000000013ca00000-0x0000000140a00000] (64MB)
iblue commented 3 years ago

Do you see earlier message about allocating SWIOTLB? There should be something like:

software IO TLB: mapped [mem 0x000000013ca00000-0x0000000140a00000] (64MB)

Yes, I saw such a line and it had 64MB in it, not sure about thr adresses. In the mean time I installed the Qubes 4.1 Alpha with the 5.4 kernel and the 4.13 xen and the module loads fine. However, I get some glitches (see picture) and messages (swiotlb buffer is full) - I think this is a known issue. After a qubes-dom0-update to xen 4.14, there were no glitches and no messages anymore (in the 30 minutes or so I had it running). So this is definetly related to the xen version, not the kernel.

However, on my 4.1 install, no Qube has any network access except for sys-net. I will continue testing tomorrow.

IMG-20210506-WA0013

iblue commented 3 years ago

So I tried with various combinations of Xen and kernels. On Xen 4.14 and 4.14.1 and kernels 5.4, 5.10 and 5.11 I get similar behavior. The dmesg is full with messages like

[  896.583875] amdgpu 0000:0a:00.0: swiotlb buffer is full (sz: 524288 bytes), total 32768 (slots), used 0 (slots)
[  896.583966] amdgpu 0000:0a:00.0: swiotlb buffer is full (sz: 524288 bytes), total 32768 (slots), used 0 (slots)
[  896.584672] amdgpu 0000:0a:00.0: swiotlb buffer is full (sz: 524288 bytes), total 32768 (slots), used 0 (slots)
[  896.584877] amdgpu 0000:0a:00.0: swiotlb buffer is full (sz: 524288 bytes), total 32768 (slots), used 0 (slots)
[  896.585007] amdgpu 0000:0a:00.0: swiotlb buffer is full (sz: 524288 bytes), total 32768 (slots), used 0 (slots)
[  896.585066] amdgpu 0000:0a:00.0: swiotlb buffer is full (sz: 524288 bytes), total 32768 (slots), used 0 (slots)

(See #5670)

On boot, I have this line:

[    0.507304] software IO TLB: mapped [mem 0x000000011ea00000-0x0000000122a00000] (64MB)

From time to time x.org crashes with a oom error but comes back. I did not have any hard reboots yet, but lots of graphical glitches. On Qubes 4.0 with Xen 4.8, I was not able to get a GUI at all.

tweidinger commented 3 years ago

4.1 with 5.10.28-1.1 and also getting amdgpu 0000:0d:00.0: swiotlb buffer is full (sz: 2097152 bytes), total 32768 (slots), used 0 (slots) with a RX 5500 XT. Can reproduce when opening multiple videos. Tried to increase the min-videoram for qubes, which had no impact.

Edit: Qubes kernel-latest (5.11.4-1) and a short script after boot to set the min-videoram accordingly to my display seems to fix this. I am not 100% sure if a combination of both or only the kernel upgrade fixes this for me but will test further. Edit2: Sadly it still occurs, but this time it crashes the X-server pretty fast and only hard reboot could restore to a working system. Edit3: Installing the x11 driver for amdgpu partially solves my issues. I get no longer full buffer messages but some black artifacts remain after some usage. See #5416 for reference Edit5: Sadly no real fix as the X server is still crashing when under heavy load. I tried passing kernel multiple kernel parameter for forcing the iommu either in hard or software with no observable changes.

Which input could I provide for further analyzing?

JarrahG commented 1 year ago

This issue seems resolved now. I've been using my 5700 for a few days without issue. Is anyone else in this thread still having the issue?