freebsd / drm-kmod

drm driver for FreeBSD
155 stars 69 forks source link

5.15.25: radeonkms: ring 0 stalled, GPU lockup, GPU reset succeeded, blackout persisted, no response to keyboard input #239

Open grahamperrin opened 1 year ago

grahamperrin commented 1 year ago

Description

Around twenty-five minutes after a crash of Firefox (https://reviews.freebsd.org/P557), I reopened Firefox.

Both displays blacked out.

If I recall correctly, the blackout began:

I waited for a minute or so, no response to keyboard input.

Hard disk drive activity was visible, so I pressed the power button for a graceful shutdown.

I started the computer, viewed logs. The result of a probe whilst drafting this issue: https://bsd-hardware.info/?probe=1a4897cb53.

An extract from /var/log/messages:

2023-02-26 09.00 messages extract.txt

DRM-related lines:

drmn0: ring 0 stalled for more than 10119msec
drmn0: GPU lockup (current fence id 0x0000000000048f10 last fence id 0x0000000000048f23 on ring 0)
drmn0: Saved 610 dwords of commands on ring 0.
drmn0: GPU softreset: 0x00000019
drmn0:   GRBM_STATUS               = 0xA2701CA0
drmn0:   GRBM_STATUS_SE0           = 0x1C000003
drmn0:   GRBM_STATUS_SE1           = 0x00000007
drmn0:   SRBM_STATUS               = 0x200000C0
drmn0:   SRBM_STATUS2              = 0x00000000
drmn0:   R_008674_CP_STALLED_STAT1 = 0x01000000
drmn0:   R_008678_CP_STALLED_STAT2 = 0x00011000
drmn0:   R_00867C_CP_BUSY_STAT     = 0x00068406
drmn0:   R_008680_CP_STAT          = 0x80878647
drmn0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
drmn0: GRBM_SOFT_RESET=0x00007F6B
drmn0: SRBM_SOFT_RESET=0x00000100
drmn0:   GRBM_STATUS               = 0x00003828
drmn0:   GRBM_STATUS_SE0           = 0x00000007
drmn0:   GRBM_STATUS_SE1           = 0x00000007
drmn0:   SRBM_STATUS               = 0x200000C0
drmn0:   SRBM_STATUS2              = 0x00000000
drmn0:   R_008674_CP_STALLED_STAT1 = 0x00000000
drmn0:   R_008678_CP_STALLED_STAT2 = 0x00000000
drmn0:   R_00867C_CP_BUSY_STAT     = 0x00000000
drmn0:   R_008680_CP_STAT          = 0x00000000
drmn0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
drmn0: GPU reset succeeded, trying to resume
[drm] enabling PCIE gen 2 link speeds, disable with radeon.pcie_gen2=0
[drm] PCIE GART of 1024M enabled (table at 0x0000000000162000).
drmn0: WB enabled
drmn0: fence driver on ring 0 use gpu addr 0x0000000040000c00
drmn0: fence driver on ring 3 use gpu addr 0x0000000040000c0c
drmn0: fence driver on ring 5 use gpu addr 0x0000000000072118
[drm] ring test on 0 succeeded in 1 usecs
[drm] ring test on 3 succeeded in 4 usecs
[drm] ring test on 5 succeeded in 2 usecs
[drm] UVD initialized successfully.

FreeBSD version

% uname -a
FreeBSD mowa219-gjp4-8570p-freebsd 14.0-CURRENT FreeBSD 14.0-CURRENT #33 main-n261014-cd406ac94d8b: Sun Feb 19 01:35:14 GMT 2023     grahamperrin@mowa219-gjp4-8570p-freebsd:/usr/obj/usr/src/amd64.amd64/sys/GENERIC-NODEBUG amd64
% uname -KU
1400081 1400081
% 

PCI info

pciconf -lv ```text hostb0@pci0:0:0:0: class=0x060000 rev=0x09 hdr=0x00 vendor=0x8086 device=0x0154 subvendor=0x103c subdevice=0x17a7 vendor = 'Intel Corporation' device = '3rd Gen Core processor DRAM Controller' class = bridge subclass = HOST-PCI pcib1@pci0:0:1:0: class=0x060400 rev=0x09 hdr=0x01 vendor=0x8086 device=0x0151 subvendor=0x8086 subdevice=0x2010 vendor = 'Intel Corporation' device = 'Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port' class = bridge subclass = PCI-PCI xhci0@pci0:0:20:0: class=0x0c0330 rev=0x04 hdr=0x00 vendor=0x8086 device=0x1e31 subvendor=0x103c subdevice=0x17a7 vendor = 'Intel Corporation' device = '7 Series/C210 Series Chipset Family USB xHCI Host Controller' class = serial bus subclass = USB none0@pci0:0:22:0: class=0x078000 rev=0x04 hdr=0x00 vendor=0x8086 device=0x1e3a subvendor=0x103c subdevice=0x17a7 vendor = 'Intel Corporation' device = '7 Series/C216 Chipset Family MEI Controller' class = simple comms uart2@pci0:0:22:3: class=0x070002 rev=0x04 hdr=0x00 vendor=0x8086 device=0x1e3d subvendor=0x103c subdevice=0x17a7 vendor = 'Intel Corporation' device = '7 Series/C210 Series Chipset Family KT Controller' class = simple comms subclass = UART em0@pci0:0:25:0: class=0x020000 rev=0x04 hdr=0x00 vendor=0x8086 device=0x1502 subvendor=0x103c subdevice=0x17a7 vendor = 'Intel Corporation' device = '82579LM Gigabit Network Connection (Lewisville)' class = network subclass = ethernet ehci0@pci0:0:26:0: class=0x0c0320 rev=0x04 hdr=0x00 vendor=0x8086 device=0x1e2d subvendor=0x103c subdevice=0x17a7 vendor = 'Intel Corporation' device = '7 Series/C216 Chipset Family USB Enhanced Host Controller' class = serial bus subclass = USB hdac1@pci0:0:27:0: class=0x040300 rev=0x04 hdr=0x00 vendor=0x8086 device=0x1e20 subvendor=0x103c subdevice=0x17a7 vendor = 'Intel Corporation' device = '7 Series/C216 Chipset Family High Definition Audio Controller' class = multimedia subclass = HDA pcib2@pci0:0:28:0: class=0x060400 rev=0xc4 hdr=0x01 vendor=0x8086 device=0x1e10 subvendor=0x103c subdevice=0x17a7 vendor = 'Intel Corporation' device = '7 Series/C216 Chipset Family PCI Express Root Port 1' class = bridge subclass = PCI-PCI pcib3@pci0:0:28:2: class=0x060400 rev=0xc4 hdr=0x01 vendor=0x8086 device=0x1e14 subvendor=0x103c subdevice=0x17a7 vendor = 'Intel Corporation' device = '7 Series/C210 Series Chipset Family PCI Express Root Port 3' class = bridge subclass = PCI-PCI pcib4@pci0:0:28:3: class=0x060400 rev=0xc4 hdr=0x01 vendor=0x8086 device=0x1e16 subvendor=0x103c subdevice=0x17a7 vendor = 'Intel Corporation' device = '7 Series/C216 Chipset Family PCI Express Root Port 4' class = bridge subclass = PCI-PCI ehci1@pci0:0:29:0: class=0x0c0320 rev=0x04 hdr=0x00 vendor=0x8086 device=0x1e26 subvendor=0x103c subdevice=0x17a7 vendor = 'Intel Corporation' device = '7 Series/C216 Chipset Family USB Enhanced Host Controller' class = serial bus subclass = USB isab0@pci0:0:31:0: class=0x060100 rev=0x04 hdr=0x00 vendor=0x8086 device=0x1e55 subvendor=0x103c subdevice=0x17a7 vendor = 'Intel Corporation' device = 'QM77 Express Chipset LPC Controller' class = bridge subclass = PCI-ISA ahci0@pci0:0:31:2: class=0x010601 rev=0x04 hdr=0x00 vendor=0x8086 device=0x1e03 subvendor=0x103c subdevice=0x17a7 vendor = 'Intel Corporation' device = '7 Series Chipset Family 6-port SATA Controller [AHCI mode]' class = mass storage subclass = SATA vgapci0@pci0:1:0:0: class=0x030000 rev=0x00 hdr=0x00 vendor=0x1002 device=0x6841 subvendor=0x103c subdevice=0x17a9 vendor = 'Advanced Micro Devices, Inc. [AMD/ATI]' device = 'Thames [Radeon HD 7550M/7570M/7650M]' class = display subclass = VGA hdac0@pci0:1:0:1: class=0x040300 rev=0x00 hdr=0x00 vendor=0x1002 device=0xaa90 subvendor=0x103c subdevice=0x17a9 vendor = 'Advanced Micro Devices, Inc. [AMD/ATI]' device = 'Turks HDMI Audio [Radeon HD 6500/6600 / 6700M Series]' class = multimedia subclass = HDA iwn0@pci0:4:0:0: class=0x028000 rev=0x34 hdr=0x00 vendor=0x8086 device=0x0082 subvendor=0x8086 subdevice=0x1301 vendor = 'Intel Corporation' device = 'Centrino Advanced-N 6205 [Taylor Peak]' class = network % ```

DRM KMOD version

% pkg query -x '%n %v' 'drm.*kmod'
drm-515-kmod 5.15.25
% pkg info drm-515-kmod | grep -e Installed -e repository
Installed on   : Sun Feb 19 15:51:53 2023 GMT
        repository     : poudriere
% 

To reproduce

The issue has not occurred frequently enough for me to make it reproducible, sorry.

This morning's blackout is, maybe, the third since I began testing drm-515-kmod.

If I recall correctly, the previous blackout was very soon after wake from sleep (moments after the SDDM lock screen appeared). At the time I was too busy/lazy to check logs, so I can't be certain that the cause was the same.

Screenshots

Not applicable.

Additional context

Firmware built from source, https://github.com/freebsd/drm-kmod-firmware/commit/d21284bf7970e87313a9aee4b39142585e0721ca (2023-02-17).

% pwd
/usr/home/grahamperrin/dev/drm-kmod-firmware
% git branch
* master
% git rev-list --max-count=1 HEAD
d21284bf7970e87313a9aee4b39142585e0721ca
% git pull --ff-only
Already up to date.
% zgrep firmware /var/log/messages.0.bz2 | tail -n 8
Feb 26 02:14:52 mowa219-gjp4-8570p-freebsd kernel: iwn0: iwn_read_firmware: ucode rev=0x12a80601
Feb 26 09:28:52 mowa219-gjp4-8570p-freebsd kernel: drmn0: successfully loaded firmware image 'radeon/TURKS_pfp.bin'
Feb 26 09:28:52 mowa219-gjp4-8570p-freebsd kernel: drmn0: successfully loaded firmware image 'radeon/TURKS_me.bin'
Feb 26 09:28:52 mowa219-gjp4-8570p-freebsd kernel: drmn0: successfully loaded firmware image 'radeon/BTC_rlc.bin'
Feb 26 09:28:52 mowa219-gjp4-8570p-freebsd kernel: drmn0: successfully loaded firmware image 'radeon/TURKS_mc.bin'
Feb 26 09:28:52 mowa219-gjp4-8570p-freebsd kernel: drmn0: successfully loaded firmware image 'radeon/TURKS_smc.bin'
Feb 26 09:28:52 mowa219-gjp4-8570p-freebsd kernel: drmn0: successfully loaded firmware image 'radeon/SUMO_uvd.bin'
Feb 26 09:28:52 mowa219-gjp4-8570p-freebsd kernel: iwn0: iwn_read_firmware: ucode rev=0x12a80601
% 
grahamperrin commented 1 year ago

… If I recall correctly, the blackout began:

  • whilst Firefox started (around 719 tabs, most hidden, across three windows on the display to the left of the notebook)
  • maybe also whilst I used a trackball to move the pointer from the display on the left, to the right.

Now, reviewing what's in the three windows, I think it more likely that:

– and for copy purposes, I typically aim for something near the pointer that will respond neatly to a double-click, so I guess I moved the pointer towards the address bar and maybe the blackout occurred before I could double-click the 14a267f652a6164d1d8c453ce19424ad7f324b49 part of the URL.

evadot commented 1 year ago

Could you ssh to the machine ?

grahamperrin commented 1 year ago

Good thinking. I didn't try ssh at the time, but given that disk activity was visible, I do strongly suspect that ssh would have worked.

After another blackout occurred, a few weeks ago I reverted to drm-510-kmod.


If I step forward again, what will be most useful (to you) for me to retry/try:

If I can ssh in when symptoms recur, what would you like me to run?

thesunexpress commented 1 year ago

719 tabs?

If on drm-515-kmod, you are on 14-CURRENT, so best to stick to drm-515 instead of master.

grahamperrin commented 1 year ago

Thanks,

… you are on 14-CURRENT …

I alrady mentioned 14.0-CURRENT, more specifically 1400081, in the opening post.

@evadot will feedback from (packaged) drm-515-kmod be good enough to progress this issue? Or would you prefer me to build from source (master)?

grahamperrin commented 1 year ago

https://github.com/FreeBSD/freebsd-ports/commit/231fddc24bd7780d2d08b63ef16a823e27385002 looks interesting, I'll build from ports.

grahamperrin commented 1 year ago

With drm-515-kmod-5.15.25_3, yesterday at 08:57:

…
drmn0: ring 0 stalled for more than 10276msec
drmn0: GPU lockup (current fence id 0x00000000000769c3 last fence id 0x00000000000769fe on ring 0)
drmn0: failed to get a new IB (-11)
[drm ERROR :radeon_cs_ib_fill] Failed to get ib !
drmn0: Saved 1874 dwords of commands on ring 0.
drmn0: GPU softreset: 0x00000019
drmn0:   GRBM_STATUS               = 0xA2703CA0
drmn0:   GRBM_STATUS_SE0           = 0x1C000007
drmn0:   GRBM_STATUS_SE1           = 0x00000007
drmn0:   SRBM_STATUS               = 0x200000C0
drmn0:   SRBM_STATUS2              = 0x00000000
drmn0:   R_008674_CP_STALLED_STAT1 = 0x01000000
drmn0:   R_008678_CP_STALLED_STAT2 = 0x00011000
drmn0:   R_00867C_CP_BUSY_STAT     = 0x00068406
drmn0:   R_008680_CP_STAT          = 0x80878647
drmn0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
drmn0: GRBM_SOFT_RESET=0x00007F6B
drmn0: SRBM_SOFT_RESET=0x00000100
drmn0:   GRBM_STATUS               = 0x00003828
drmn0:   GRBM_STATUS_SE0           = 0x00000007
drmn0:   GRBM_STATUS_SE1           = 0x00000007
drmn0:   SRBM_STATUS               = 0x200000C0
drmn0:   SRBM_STATUS2              = 0x00000000
drmn0:   R_008674_CP_STALLED_STAT1 = 0x00000000
drmn0:   R_008678_CP_STALLED_STAT2 = 0x00000000
drmn0:   R_00867C_CP_BUSY_STAT     = 0x00000000
drmn0:   R_008680_CP_STAT          = 0x00000000
drmn0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
drmn0: GPU reset succeeded, trying to resume
[drm] enabling PCIE gen 2 link speeds, disable with radeon.pcie_gen2=0
[drm] PCIE GART of 1024M enabled (table at 0x0000000000162000).
drmn0: WB enabled
drmn0: fence driver on ring 0 use gpu addr 0x0000000040000c00
drmn0: fence driver on ring 3 use gpu addr 0x0000000040000c0c
drmn0: fence driver on ring 5 use gpu addr 0x0000000000072118
[drm] ring test on 0 succeeded in 1 usecs
[drm] ring test on 3 succeeded in 4 usecs
[drm] ring test on 5 succeeded in 2 usecs
[drm] UVD initialized successfully.
[drm] ib test on ring 0 succeeded in 0 usecs
[drm] ib test on ring 3 succeeded in 0 usecs
[drm] ib test on ring 5 succeeded
…

After the event, 09:02, the result of a probe: https://bsd-hardware.info/?probe=95f2b2f9d6.

09:04:

2023-05-15 09 04

I might have run plasmashell --replace, instead I chose to restart the computer.

Context (08:00:00–09:07):

messages.txt

mfoacs commented 8 months ago

I'm experiencing the same issue, and managed to reproduced it somewhat consistently.

To Reproduce: Freshly started session on Sway or Hpyrland with swayidle/swaylock in the background: swayidle -w timeout 300 'swaylock -f -c 000000' timeout 600 'swaymsg "output * power off"' resume 'swaymsg "output * power on"' before-sleep 'swaylock -f -c 000000 --effect-blur 7x5'

FreeBSD version

FreeBSD hawkeye.stormriders.local 14.0-RELEASE-p3 FreeBSD 14.0-RELEASE-p3 #0: Mon Dec 11 04:56:01 UTC 2023     root@amd64-builder.daemonology.net:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64
❯ uname -KU
% 1400097 1400097

PCI Info

❯ pciconf -lv                                                                                                                                                                                                                              
hostb0@pci0:0:0:0:  class=0x060000 rev=0x06 hdr=0x00 vendor=0x8086 device=0x0c00 subvendor=0x1043 subdevice=0x8534
    vendor     = 'Intel Corporation'
    device     = '4th Gen Core Processor DRAM Controller'
    class      = bridge
    subclass   = HOST-PCI
pcib1@pci0:0:1:0:   class=0x060400 rev=0x06 hdr=0x01 vendor=0x8086 device=0x0c01 subvendor=0x1043 subdevice=0x8534
    vendor     = 'Intel Corporation'
    device     = 'Xeon E3-1200 v3/4th Gen Core Processor PCI Express x16 Controller'
    class      = bridge
    subclass   = PCI-PCI
pcib4@pci0:0:1:1:   class=0x060400 rev=0x06 hdr=0x01 vendor=0x8086 device=0x0c05 subvendor=0x1043 subdevice=0x8534
    vendor     = 'Intel Corporation'
    device     = 'Xeon E3-1200 v3/4th Gen Core Processor PCI Express x8 Controller'
    class      = bridge
    subclass   = PCI-PCI
xhci0@pci0:0:20:0:  class=0x0c0330 rev=0x00 hdr=0x00 vendor=0x8086 device=0x8cb1 subvendor=0x1043 subdevice=0x8534
    vendor     = 'Intel Corporation'
    device     = '9 Series Chipset Family USB xHCI Controller'
    class      = serial bus
    subclass   = USB
none0@pci0:0:22:0:  class=0x078000 rev=0x00 hdr=0x00 vendor=0x8086 device=0x8cba subvendor=0x1043 subdevice=0x8534
    vendor     = 'Intel Corporation'
    device     = '9 Series Chipset Family ME Interface'
    class      = simple comms
em0@pci0:0:25:0:    class=0x020000 rev=0x00 hdr=0x00 vendor=0x8086 device=0x15a1 subvendor=0x1043 subdevice=0x85c4
    vendor     = 'Intel Corporation'
    device     = 'Ethernet Connection (2) I218-V'
    class      = network
    subclass   = ethernet
ehci0@pci0:0:26:0:  class=0x0c0320 rev=0x00 hdr=0x00 vendor=0x8086 device=0x8cad subvendor=0x1043 subdevice=0x8534
    vendor     = 'Intel Corporation'
    device     = '9 Series Chipset Family USB EHCI Controller'
    class      = serial bus
    subclass   = USB
hdac1@pci0:0:27:0:  class=0x040300 rev=0x00 hdr=0x00 vendor=0x8086 device=0x8ca0 subvendor=0x1043 subdevice=0x860b
    vendor     = 'Intel Corporation'
    device     = '9 Series Chipset Family HD Audio Controller'
    class      = multimedia
    subclass   = HDA
pcib5@pci0:0:28:0:  class=0x060400 rev=0xd0 hdr=0x01 vendor=0x8086 device=0x8c90 subvendor=0x1043 subdevice=0x8534
    vendor     = 'Intel Corporation'
    device     = '9 Series Chipset Family PCI Express Root Port 1'
    class      = bridge
    subclass   = PCI-PCI
pcib6@pci0:0:28:3:  class=0x060401 rev=0xd0 hdr=0x01 vendor=0x8086 device=0x244e subvendor=0x1043 subdevice=0x8534
    vendor     = 'Intel Corporation'
    device     = '82801 PCI Bridge'
    class      = bridge
    subclass   = PCI-PCI
ehci1@pci0:0:29:0:  class=0x0c0320 rev=0x00 hdr=0x00 vendor=0x8086 device=0x8ca6 subvendor=0x1043 subdevice=0x8534
    vendor     = 'Intel Corporation'
    device     = '9 Series Chipset Family USB EHCI Controller'
    class      = serial bus
    subclass   = USB
isab0@pci0:0:31:0:  class=0x060100 rev=0x00 hdr=0x00 vendor=0x8086 device=0x8cc4 subvendor=0x1043 subdevice=0x8534
    vendor     = 'Intel Corporation'
    device     = 'Z97 Chipset LPC Controller'
    class      = bridge
    subclass   = PCI-ISA
ahci0@pci0:0:31:2:  class=0x010601 rev=0x00 hdr=0x00 vendor=0x8086 device=0x8c82 subvendor=0x1043 subdevice=0x8534
    vendor     = 'Intel Corporation'
    device     = '9 Series Chipset Family SATA Controller [AHCI Mode]'
    class      = mass storage
    subclass   = SATA
ichsmb0@pci0:0:31:3:    class=0x0c0500 rev=0x00 hdr=0x00 vendor=0x8086 device=0x8ca2 subvendor=0x1043 subdevice=0x8534
    vendor     = 'Intel Corporation'
    device     = '9 Series Chipset Family SMBus Controller'
    class      = serial bus
    subclass   = SMBus
pcib2@pci0:1:0:0:   class=0x060400 rev=0xc7 hdr=0x01 vendor=0x1002 device=0x1478 subvendor=0x0000 subdevice=0x0000
    vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     = 'Navi 10 XL Upstream Port of PCI Express Switch'
    class      = bridge
    subclass   = PCI-PCI
pcib3@pci0:2:0:0:   class=0x060400 rev=0x00 hdr=0x01 vendor=0x1002 device=0x1479 subvendor=0x1002 subdevice=0x1479
    vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     = 'Navi 10 XL Downstream Port of PCI Express Switch'
    class      = bridge
    subclass   = PCI-PCI
vgapci0@pci0:3:0:0: class=0x030000 rev=0xc7 hdr=0x00 vendor=0x1002 device=0x73ff subvendor=0x1043 subdevice=0x05d5
    vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     = 'Navi 23 [Radeon RX 6600/6600 XT/6600M]'
    class      = display
    subclass   = VGA
hdac0@pci0:3:0:1:   class=0x040300 rev=0x00 hdr=0x00 vendor=0x1002 device=0xab28 subvendor=0x1002 subdevice=0xab28
    vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     = 'Navi 21/23 HDMI/DP Audio Controller'
    class      = multimedia
    subclass   = HDA
rtwn0@pci0:4:0:0:   class=0x028000 rev=0x01 hdr=0x00 vendor=0x10ec device=0x8179 subvendor=0x10ec subdevice=0x8197
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8188EE Wireless Network Adapter'
    class      = network
pcib7@pci0:6:0:0:   class=0x060401 rev=0x04 hdr=0x01 vendor=0x1b21 device=0x1080 subvendor=0x1043 subdevice=0x8489
    vendor     = 'ASMedia Technology Inc.'
    device     = 'ASM1083/1085 PCIe to PCI Bridge'
    class      = bridge
    subclass   = PCI-PCI

DRM Kmod

❯ sudo pkg query -x '%n %v' 'drm.*kmod'                                                                                                                                                                                                    
drm-515-kmod 5.15.118_3
drm-kmod 20220907_1