Closed KileXt closed 2 years ago
Hi, I'm also experiencing a kernel panic with linux-hardened 5.13.13.hardened1-1
on ArchLinux.
System info:
Unfortunately, when the kernel panic occurs, I don't get anything in the systemd logs (journalctl -b -1
has no mention of kernel panic after booting up with a different kernel). I do have an image of my screen when the kernel panic has occurred:
image of screen when kernel panic occurs
I hope this is enough information to help debug the issue, but if more info is needed, do please let me know.
Hi, I can confirm the issue on Arch. Seeing nothing in systemd logs makes sense: the kernel paniced before giving the hand to PID 1. Can see it with "Kernel panic - not syncing".
Switched back to 5.12.19-hardened1-1-hardened and works like a charm. AMD CPU & GPU too.
So I just tested the new kernel in a VM and it booted without problems. I also tried to boot it with CPU set to host-passthrough, with an AMD CPU on the host. It still boots.
Edit: tested on another computer (Intel CPU) and no kernel panic. May be related to AMD hardware ?
If I had to guess, I'd say it is related to AMD GPU. I could try to confirm that by exchanging my GPU with some nvdia GTX970 I have laying around. I will try that ASAP.
I reinstalled my entire arch system since I opened the issue and I still got the kernel panic.
If I had to guess, I'd say it is related to AMD GPU. I could try to confirm that by exchanging my GPU with some nvdia GTX970 I have laying around. I will try that ASAP.
So I tried that and indeed it did boot with the nvidia GTX and not with the RX 5700XT of AMD. So AMD GPUs seem to be the culprit here.
So I compiled the kernel by hand. On reboot, still a panic. From the logs, part 1 we can see the culprit is amdgpu. Here is the second part of the kernel logs.
I would like to help more but my knowledge kinda stops here. I will try downgrading kernels until I find one that works. Then a diff between two sources might help troubleshooting
Edit: I just learned about bisecting bugs. That's neat !
I just disabled amdgpu (modprobe.blacklist=amdgpu) and the kernel started, I was given a login prompt. So now we can be sure it IS related to that driver in some way
I just read your logs @Stephen-Seo and it looks like we are having two different bugs. Stack traces are different.
FWIW, we have required amdgpu firmware built into the kernel and have not experienced this.
I just read your logs @Stephen-Seo and it looks like we are having two different bugs. Stack traces are different.
Hmm, maybe it's because all the filesystems I use on this system are btrfs instead of ext4? It does mention btrfs in the call stack. Well this system is currently using btrfs on top of encrypted luks on top of nvme ssd, if that info helps..
I just read your logs @Stephen-Seo and it looks like we are having two different bugs. Stack traces are different.
Hmm, maybe it's because all the filesystems I use on this system are btrfs instead of ext4? It does mention btrfs in the call stack. Well this system is currently using btrfs on top of encrypted luks on top of nvme ssd, if that info helps..
Well my disk is 100% btrfs. I still think we are having two different software bugs.
So my problem is also present on regular Linux kernel. After bisecting, the faulty commit has been found. linux-hardened has nothing to do with my issue.
btw, thank you for maintaining that kernel branch
@KileXt Can you please share your kernel .config(zcat /proc/config.gz
) I have 5700XT and compiled linux-hardened 5.13.13.hardened1 but can't reproduce it.
This [https://github.com/nirmoy/linux/commit/302d8a72e43f16505335a75396d77bc6f9705b3b] should fix the issue. Problem is debugfs_create_file_size() is trying set file size on NULL dentry.
This [nirmoy/linux@302d8a7 should fix the issue. Problem is debugfs_create_file_size() is trying set file size on NULL dentry.
Your patch fixed the kernel panic for me, thanks for the patch.
FYI: Your link currently has an extra square-bracket appended to the url. I think this link is what you wanted to link to.
This [nirmoy/linux@302d8a7 should fix the issue. Problem is debugfs_create_file_size() is trying set file size on NULL dentry.
It seems more correct to return a non-zero value in amdgpu_ring.c
, as a return value of zero indicates success. When I changed it to a non-zero, I do get logs about [drm:amdgpu_debugfs_init [amdgpu]] *ERROR* Failed to register debugfs file for rings !
, but still without kernel panic, and things appear to work fine.
Thank you all for helping to debug this issue!
Do you know whether it has been reported to upstream maintainers?
@tsautereau-anssi yes, upstream is aware and working on it: https://gitlab.freedesktop.org/drm/amd/-/issues/1686#note_1052168
@anthraxx: This was the same issue that I talked to you about, and it's fixed in kernel 5.14.8 (this comment suggests that it was fixed in 5.14.7 but I didn't get around to test 5.14.7 on the affected machine).
I can confirm the bug is fixed, I'm running 5.14.9-hardened1-1-hardened without any problem.
Just like 0xFunKy, I also am running 5.14.9.hardened1-1 with no issues. Perhaps it's safe to close this issue?
Hi, I have been running arch-linux with linux-hardened for a few month now. Since yesterdays update I am getting a Kernel panic, it is solved by using another kernel. Here is what I can gather from journalctl:
_Aug 28 16:10:33 ryzen kernel: #PF: error_code(0x0000) - not-present page Aug 28 16:10:33 ryzen kernel: #PF: supervisor read access in kernel mode Aug 28 16:10:33 ryzen kernel: scsi host14: usb-storage 1-4:1.0 Aug 28 16:10:33 ryzen kernel: BUG: kernel NULL pointer dereference, address: 000000000000002f Aug 28 16:10:33 ryzen kernel: [drm] Initialized amdgpu 3.41.0 20150101 for 0000:24:00.0 on minor 0 Aug 28 16:10:33 ryzen kernel: usb-storage 1-4:1.0: USB Mass Storage device detected Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: Using BACO for runtime pm Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on hub 1 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on hub 1 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on hub 1 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0 Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: amdgpu: ring gfx0.0.0 uses VM inv eng 0 on hub 0 Aug 28 16:10:33 ryzen mtp-probe[992]: bus: 1, device: 3 was not an MTP device Aug 28 16:10:33 ryzen mtp-probe[991]: bus: 1, device: 4 was not an MTP device Aug 28 16:10:33 ryzen mtp-probe[990]: bus: 1, device: 2 was not an MTP device Aug 28 16:10:33 ryzen mtp-probe[992]: checking bus 1, device 3: "/sys/devices/pci0000:00/0000:00:01.3/0000:03:00.0/usb1/1-5" Aug 28 16:10:33 ryzen mtp-probe[991]: checking bus 1, device 4: "/sys/devices/pci0000:00/0000:00:01.3/0000:03:00.0/usb1/1-6" Aug 28 16:10:33 ryzen mtp-probe[990]: checking bus 1, device 2: "/sys/devices/pci0000:00/0000:00:01.3/0000:03:00.0/usb1/1-4" Aug 28 16:10:33 ryzen kernel: amdgpu 0000:24:00.0: [drm] fb0: amdgpudrmfb frame buffer device Aug 28 16:10:33 ryzen kernel: usb 1-9: New USB device strings: Mfr=0, Product=0, SerialNumber=0 Aug 28 16:10:33 ryzen kernel: usb 1-9: New USB device found, idVendor=8087, idProduct=0aa7, bcdDevice= 0.01 Aug 28 16:10:33 ryzen kernel: Console: switching to colour frame buffer device 160x45
I am not sure what debug code can be useful. Here is my hardware:
Thanks for your time.