Kernel Panic on linux-hardened 5.13.13.hardened1-1

KileXt commented 2 years ago

Hi, I have been running arch-linux with linux-hardened for a few month now. Since yesterdays update I am getting a Kernel panic, it is solved by using another kernel. Here is what I can gather from journalctl:

_Aug 28 16:10:33 ryzen kernel: #PF: error_code(0x0000) - not-present page Aug 28 16:10:33 ryzen kernel: #PF: supervisor read access in kernel mode Aug 28 16:10:33 ryzen kernel: scsi host14: usb-storage 1-4:1.0 Aug 28 16:10:33 ryzen kernel: BUG: kernel NULL pointer dereference, address: 000000000000002f Aug 28 16:10:33 ryzen kernel: [drm] Initialized amdgpu 3.41.0 20150101 for 0000:24:00.0 on minor 0 Aug 28 16:10:33 ryzen kernel: usb-storage 1-4:1.0: USB Mass Storage device detected Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: Using BACO for runtime pm Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on hub 1 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on hub 1 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on hub 1 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring gfx0.0.0 uses VM inv eng 0 on hub 0 Aug 28 16:10:33 ryzen mtp-probe[992]: bus: 1, device: 3 was not an MTP device Aug 28 16:10:33 ryzen mtp-probe[991]: bus: 1, device: 4 was not an MTP device Aug 28 16:10:33 ryzen mtp-probe[990]: bus: 1, device: 2 was not an MTP device Aug 28 16:10:33 ryzen mtp-probe[992]: checking bus 1, device 3: "/sys/devices/pci0000:00/0000:00:01.3/0000:03:00.0/usb1/1-5" Aug 28 16:10:33 ryzen mtp-probe[991]: checking bus 1, device 4: "/sys/devices/pci0000:00/0000:00:01.3/0000:03:00.0/usb1/1-6" Aug 28 16:10:33 ryzen mtp-probe[990]: checking bus 1, device 2: "/sys/devices/pci0000:00/0000:00:01.3/0000:03:00.0/usb1/1-4" Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: [drm] fb0: amdgpudrmfb frame buffer device Aug 28 16:10:33 ryzen kernel: usb 1-9: New USB device strings: Mfr=0, Product=0, SerialNumber=0 Aug 28 16:10:33 ryzen kernel: usb 1-9: New USB device found, idVendor=8087, idProduct=0aa7, bcdDevice= 0.01 Aug 28 16:10:33 ryzen kernel: Console: switching to colour frame buffer device 160x45

I am not sure what debug code can be useful. Here is my hardware:

Asrock Taichi X370
Ryzen 1700
AMD RX 5700XT

Thanks for your time.

Stephen-Seo commented 2 years ago

Hi, I'm also experiencing a kernel panic with linux-hardened 5.13.13.hardened1-1 on ArchLinux.
System info:

CPU: AMD Ryzen 7 5800X
GPU: AMD ATI Radeon RX 6700/6700 XT / 6800M

Unfortunately, when the kernel panic occurs, I don't get anything in the systemd logs (journalctl -b -1 has no mention of kernel panic after booting up with a different kernel). I do have an image of my screen when the kernel panic has occurred:
image of screen when kernel panic occurs

I hope this is enough information to help debug the issue, but if more info is needed, do please let me know.

alexdub37 commented 2 years ago

Hi, I can confirm the issue on Arch. Seeing nothing in systemd logs makes sense: the kernel paniced before giving the hand to PID 1. Can see it with "Kernel panic - not syncing".

Switched back to 5.12.19-hardened1-1-hardened and works like a charm. AMD CPU & GPU too.

alexdub37 commented 2 years ago

So I just tested the new kernel in a VM and it booted without problems. I also tried to boot it with CPU set to host-passthrough, with an AMD CPU on the host. It still boots.

Edit: tested on another computer (Intel CPU) and no kernel panic. May be related to AMD hardware ?

KileXt commented 2 years ago

If I had to guess, I'd say it is related to AMD GPU. I could try to confirm that by exchanging my GPU with some nvdia GTX970 I have laying around. I will try that ASAP.

I reinstalled my entire arch system since I opened the issue and I still got the kernel panic.

KileXt commented 2 years ago

If I had to guess, I'd say it is related to AMD GPU. I could try to confirm that by exchanging my GPU with some nvdia GTX970 I have laying around. I will try that ASAP.

So I tried that and indeed it did boot with the nvidia GTX and not with the RX 5700XT of AMD. So AMD GPUs seem to be the culprit here.

alexdub37 commented 2 years ago

So I compiled the kernel by hand. On reboot, still a panic. From the logs, part 1 we can see the culprit is amdgpu. Here is the second part of the kernel logs.

I would like to help more but my knowledge kinda stops here. I will try downgrading kernels until I find one that works. Then a diff between two sources might help troubleshooting

Edit: I just learned about bisecting bugs. That's neat !

alexdub37 commented 2 years ago

I just disabled amdgpu (modprobe.blacklist=amdgpu) and the kernel started, I was given a login prompt. So now we can be sure it IS related to that driver in some way

alexdub37 commented 2 years ago

I might be wrong but this looks like the exact same issue.

alexdub37 commented 2 years ago

I just read your logs @Stephen-Seo and it looks like we are having two different bugs. Stack traces are different.

beaglesnuf commented 2 years ago

FWIW, we have required amdgpu firmware built into the kernel and have not experienced this.

Stephen-Seo commented 2 years ago

I just read your logs @Stephen-Seo and it looks like we are having two different bugs. Stack traces are different.

Hmm, maybe it's because all the filesystems I use on this system are btrfs instead of ext4? It does mention btrfs in the call stack. Well this system is currently using btrfs on top of encrypted luks on top of nvme ssd, if that info helps..

alexdub37 commented 2 years ago

I just read your logs @Stephen-Seo and it looks like we are having two different bugs. Stack traces are different.

Hmm, maybe it's because all the filesystems I use on this system are btrfs instead of ext4? It does mention btrfs in the call stack. Well this system is currently using btrfs on top of encrypted luks on top of nvme ssd, if that info helps..

Well my disk is 100% btrfs. I still think we are having two different software bugs.

alexdub37 commented 2 years ago

So my problem is also present on regular Linux kernel. After bisecting, the faulty commit has been found. linux-hardened has nothing to do with my issue.

btw, thank you for maintaining that kernel branch

nirmoy commented 2 years ago

@KileXt Can you please share your kernel .config(zcat /proc/config.gz) I have 5700XT and compiled linux-hardened 5.13.13.hardened1 but can't reproduce it.

nirmoy commented 2 years ago

This [https://github.com/nirmoy/linux/commit/302d8a72e43f16505335a75396d77bc6f9705b3b] should fix the issue. Problem is debugfs_create_file_size() is trying set file size on NULL dentry.

Stephen-Seo commented 2 years ago

This [nirmoy/linux@302d8a7 should fix the issue. Problem is debugfs_create_file_size() is trying set file size on NULL dentry.

Your patch fixed the kernel panic for me, thanks for the patch.

FYI: Your link currently has an extra square-bracket appended to the url. I think this link is what you wanted to link to.

Stephen-Seo commented 2 years ago

This [nirmoy/linux@302d8a7 should fix the issue. Problem is debugfs_create_file_size() is trying set file size on NULL dentry.

It seems more correct to return a non-zero value in amdgpu_ring.c, as a return value of zero indicates success. When I changed it to a non-zero, I do get logs about [drm:amdgpu_debugfs_init [amdgpu]] *ERROR* Failed to register debugfs file for rings !, but still without kernel panic, and things appear to work fine.

tsautereau-anssi commented 2 years ago

Thank you all for helping to debug this issue!

Do you know whether it has been reported to upstream maintainers?

anthraxx commented 2 years ago

@tsautereau-anssi yes, upstream is aware and working on it: https://gitlab.freedesktop.org/drm/amd/-/issues/1686#note_1052168

hardfalcon commented 2 years ago

@anthraxx: This was the same issue that I talked to you about, and it's fixed in kernel 5.14.8 (this comment suggests that it was fixed in 5.14.7 but I didn't get around to test 5.14.7 on the affected machine).

alexdub37 commented 2 years ago

I can confirm the bug is fixed, I'm running 5.14.9-hardened1-1-hardened without any problem.

Stephen-Seo commented 2 years ago

Just like 0xFunKy, I also am running 5.14.9.hardened1-1 with no issues. Perhaps it's safe to close this issue?

anthraxx / linux-hardened

Kernel Panic on linux-hardened 5.13.13.hardened1-1 #63