FreeBSDDesktop / DEPRECATED-freebsd-base-graphics

Fork of FreeBSD's base repository to work on graphics-stack-related projects
Other
49 stars 13 forks source link

Panic after loading i915kms: Hang on render ring -> page fault in reset_common_ring #163

Closed cperciva closed 5 years ago

cperciva commented 7 years ago

Repeatably, about 5 seconds after kldload i915kms (transcribing, apologies for eliding unhelpful boilerplate):

<6>[drm] GPU HANG: ecode 9:0:0xfffffffe, reason: Hang on render ring, action: reset
... blah blah blah ...

Fatal trap 12: page fault while in kernel mode
fault virtual address = 0xa8
...
Stopped at reset_common_ring+0xc9
...
reset_common_ring() at reset_comon_ring+0xc9
i915_gem_reset() at i915_gem_reset+0x17e
i915_reset() at i915_reset+0xa8
i915_handle_error() at i915_handle_error+0x7ae
i915_hangcheck_elapsed() at i915_hangcheck_elapsed+0x475
...

This is on a Core i5-7200U laptop using the latest drm-next code. Any ideas?

markjdb commented 7 years ago

Is this a regression?

cperciva commented 7 years ago

I can neither confirm nor deny. This is the first time I've tried drm-next on this hardware.

cperciva commented 7 years ago

Disabling hangcheck by setting i915.enable_hangcheck=0 in i915_hangcheck_elapsed fixes the panic, with no apparent adverse effects; so maybe there's actually two bugs here:

  1. Resetting the GPU driver doesn't work.

  2. The hangcheck firing when the GPU didn't actually hang.

vishwin commented 7 years ago

This exactly happens on my i7-5500U (Broadwell) laptop with the latest drm-next as of this writing. Drops me into ddb after a successful modeset from the UEFI framebuffer.

@cperciva , where exactly did you pass that parameter? Certainly not in loader.conf or kenv as loading the module still results in the same panic, and I'm not sure where you got the i915_hangcheck_elapsed from besides the printed output when module loading panics.

cperciva commented 7 years ago

@vishwin That wasn't a parameter, that was me adding a line of code and recompiling.

vishwin commented 7 years ago

@cperciva got it. For those else who may be searching and wondering, change the variable in i915_params.c.

cperciva commented 7 years ago

Well, this is weird. This panic and the hang in #165 went away when I switched from building a kernel from the drm-next branch to building a kernel from HEAD and building drm-next-kmod from the ports tree. And I can't see anything at all in the tree which would explain this.

So, uhh... good work guys?

cperciva commented 7 years ago

Ok, I figured out why this problem comes and goes. On my laptop, with the code in the tree, I consistently get this panic when I load i915kms if the laptop isn't plugged in. Running on AC power, no panic.

mattmacy commented 7 years ago

0.o

vishwin commented 7 years ago

I get panics randomly when I am on AC as well, but definitely consistent panics (so far) on battery.

mattmacy commented 7 years ago

What is your hardware? I haven't hit any issues myself, trying to figure out what it corresponds to.

vishwin commented 7 years ago

ThinkPad W550s, Intel i7-5500U (Broadwell) with a headless (Optimus) Nvidia Quadro. The Nvidia card is not used at all, nor is the driver for such loaded.

mattmacy commented 7 years ago

@vishwin Can you paste the output of running 'pciconf -lbVc' as root?

vishwin commented 7 years ago
hostb0@pci0:0:0:0:  class=0x060000 card=0x222317aa chip=0x16048086 rev=0x09 hdr=0x00
    cap 09[e0] = vendor (length 12) Intel cap 0 version 1
vgapci0@pci0:0:2:0: class=0x030000 card=0x222517aa chip=0x16168086 rev=0x09 hdr=0x00
    bar   [10] = type Memory, range 64, base 0xf2000000, size 16777216, enabled
    bar   [18] = type Prefetchable Memory, range 64, base 0xc0000000, size 536870912, enabled
    bar   [20] = type I/O Port, range 32, base 0x4000, size 64, enabled
    cap 05[90] = MSI supports 1 message 
    cap 01[d0] = powerspec 2  supports D0 D3  current D0
    cap 13[a4] = PCI Advanced Features: FLR TP
hdac0@pci0:0:3:0:   class=0x040300 card=0x222317aa chip=0x160c8086 rev=0x09 hdr=0x00
    bar   [10] = type Memory, range 64, base 0xf4230000, size 16384, enabled
    cap 01[50] = powerspec 2  supports D0 D3  current D0
    cap 05[60] = MSI supports 1 message enabled with 1 message
    cap 10[70] = PCI-Express 1 root endpoint max data 128(128) FLR NS
xhci0@pci0:0:20:0:  class=0x0c0330 card=0x222317aa chip=0x9cb18086 rev=0x03 hdr=0x00
    bar   [10] = type Memory, range 64, base 0xf4220000, size 65536, enabled
    cap 01[70] = powerspec 2  supports D0 D3  current D0
    cap 05[80] = MSI supports 8 messages, 64 bit enabled with 1 message
none0@pci0:0:22:0:  class=0x078000 card=0x222317aa chip=0x9cba8086 rev=0x03 hdr=0x00
    bar   [10] = type Memory, range 64, base 0xf4239000, size 32, enabled
    cap 01[50] = powerspec 3  supports D0 D3  current D0
    cap 05[8c] = MSI supports 1 message, 64 bit 
em0@pci0:0:25:0:    class=0x020000 card=0x222617aa chip=0x15a38086 rev=0x03 hdr=0x00
    bar   [10] = type Memory, range 32, base 0xf4200000, size 131072, enabled
    bar   [14] = type Memory, range 32, base 0xf423e000, size 4096, enabled
    bar   [18] = type I/O Port, range 32, base 0x4080, size 32, enabled
    cap 01[c8] = powerspec 2  supports D0 D3  current D0
    cap 05[d0] = MSI supports 1 message, 64 bit enabled with 1 message
    cap 13[e0] = PCI Advanced Features: FLR TP
hdac1@pci0:0:27:0:  class=0x040300 card=0x222317aa chip=0x9ca08086 rev=0x03 hdr=0x00
    bar   [10] = type Memory, range 64, base 0xf4234000, size 16384, enabled
    cap 01[50] = powerspec 3  supports D0 D3  current D0
    cap 05[60] = MSI supports 1 message, 64 bit enabled with 1 message
pcib1@pci0:0:28:0:  class=0x060400 card=0x222317aa chip=0x9c9a8086 rev=0xe3 hdr=0x01
    cap 10[40] = PCI-Express 2 root port max data 128(128)
                 link x1(x1) speed 2.5(5.0) ASPM L0s/L1(L0s/L1)
                 slot 5 power limit 100 mW
    cap 05[80] = MSI supports 1 message 
    cap 0d[90] = PCI Bridge card=0x222317aa
    cap 01[a0] = powerspec 3  supports D0 D3  current D0
    ecap 0000[100] = unknown 0
    ecap 001e[200] = unknown 1
pcib2@pci0:0:28:1:  class=0x060400 card=0x222317aa chip=0x9c948086 rev=0xe3 hdr=0x01
    cap 10[40] = PCI-Express 2 root port max data 128(128)
                 link x1(x1) speed 2.5(5.0) ASPM L1(L0s/L1)
                 slot 2 power limit 100 mW
    cap 05[80] = MSI supports 1 message 
    cap 0d[90] = PCI Bridge card=0x222317aa
    cap 01[a0] = powerspec 3  supports D0 D3  current D0
    ecap 0000[100] = unknown 0
    ecap 001e[200] = unknown 1
pcib3@pci0:0:28:4:  class=0x060400 card=0x222317aa chip=0x9c988086 rev=0xe3 hdr=0x01
    cap 10[40] = PCI-Express 2 root port max data 128(128)
                 link x4(x4) speed 5.0(5.0) ASPM L0s/L1(L0s/L1)
                 slot 4 power limit 250 mW
    cap 05[80] = MSI supports 1 message 
    cap 0d[90] = PCI Bridge card=0x222317aa
    cap 01[a0] = powerspec 3  supports D0 D3  current D0
    ecap 0000[100] = unknown 0
    ecap 001e[200] = unknown 1
ehci0@pci0:0:29:0:  class=0x0c0320 card=0x222317aa chip=0x9ca68086 rev=0x03 hdr=0x00
    bar   [10] = type Memory, range 32, base 0xf423d000, size 1024, enabled
    cap 01[50] = powerspec 3  supports D0 D3  current D0
    cap 0a[58] = EHCI Debug Port at offset 0xa0 in map 0x14
    cap 13[98] = PCI Advanced Features: FLR TP
isab0@pci0:0:31:0:  class=0x060100 card=0x222317aa chip=0x9cc38086 rev=0x03 hdr=0x00
    cap 09[e0] = vendor (length 12) Intel cap 1 version 0
         features: AMT, 4 PCI-e x1 slots
ahci0@pci0:0:31:2:  class=0x010601 card=0x222317aa chip=0x9c838086 rev=0x03 hdr=0x00
    bar   [10] = type I/O Port, range 32, base 0x40a8, size 8, enabled
    bar   [14] = type I/O Port, range 32, base 0x40b4, size 4, enabled
    bar   [18] = type I/O Port, range 32, base 0x40a0, size 8, enabled
    bar   [1c] = type I/O Port, range 32, base 0x40b0, size 4, enabled
    bar   [20] = type I/O Port, range 32, base 0x4060, size 32, enabled
    bar   [24] = type Memory, range 32, base 0xf423c000, size 2048, enabled
    cap 05[80] = MSI supports 1 message enabled with 1 message
    cap 01[70] = powerspec 3  supports D0 D3  current D0
    cap 12[a8] = SATA Index-Data Pair
none1@pci0:0:31:3:  class=0x0c0500 card=0x222317aa chip=0x9ca28086 rev=0x03 hdr=0x00
    bar   [10] = type Memory, range 64, base 0xf4238000, size 256, enabled
    bar   [20] = type I/O Port, range 32, base 0xefa0, size 32, enabled
none2@pci0:0:31:6:  class=0x118000 card=0x222317aa chip=0x9ca48086 rev=0x03 hdr=0x00
    bar   [10] = type Memory, range 64, base 0xf423b000, size 4096, enabled
    cap 01[50] = powerspec 3  supports D0 D3  current D0
    cap 05[80] = MSI supports 1 message 
none3@pci0:2:0:0:   class=0xff0000 card=0x222317aa chip=0x522710ec rev=0x01 hdr=0x00
    bar   [10] = type Memory, range 32, base 0xf4100000, size 4096, enabled
    cap 01[40] = powerspec 3  supports D0 D1 D2 D3  current D0
    cap 05[50] = MSI supports 1 message, 64 bit 
    cap 10[70] = PCI-Express 2 endpoint max data 128(128) RO
                 link x1(x1) speed 2.5(2.5) ASPM L0s/L1(L0s/L1)
    ecap 0001[100] = AER 2 0 fatal 0 non-fatal 0 corrected
    ecap 0003[140] = Serial 1 00000001004ce000
    ecap 0018[150] = LTR 1
    ecap 001e[158] = unknown 1
iwm0@pci0:3:0:0:    class=0x028000 card=0x52108086 chip=0x095b8086 rev=0x59 hdr=0x00
    bar   [10] = type Memory, range 64, base 0xf4000000, size 8192, enabled
    cap 01[c8] = powerspec 3  supports D0 D3  current D0
    cap 05[d0] = MSI supports 1 message, 64 bit enabled with 1 message
    cap 10[40] = PCI-Express 2 endpoint max data 128(128) FLR RO NS
                 link x1(x1) speed 2.5(2.5) ASPM L1(L1)
    ecap 0001[100] = AER 1 0 fatal 0 non-fatal 0 corrected
    ecap 0003[140] = Serial 1 340286ffff030d90
    ecap 0018[14c] = LTR 1
    ecap 001e[154] = unknown 1
vgapci1@pci0:8:0:0: class=0x030200 card=0x222517aa chip=0x137a10de rev=0xa2 hdr=0x00
    bar   [10] = type Memory, range 32, base 0xf3000000, size 16777216, enabled
    bar   [14] = type Prefetchable Memory, range 64, base 0xe0000000, size 268435456, enabled
    bar   [1c] = type Prefetchable Memory, range 64, base 0xf0000000, size 33554432, enabled
    bar   [24] = type I/O Port, range 32, base 0x3000, size 128, enabled
    cap 01[60] = powerspec 3  supports D0 D3  current D0
    cap 05[68] = MSI supports 1 message, 64 bit 
    cap 10[78] = PCI-Express 2 endpoint max data 128(256) RO NS
                 link x4(x4) speed 5.0(8.0) ASPM L0s/L1(L0s/L1)
    ecap 0002[100] = VC 1 max VC0
    ecap 0018[250] = LTR 1
    ecap 001e[258] = unknown 1
    ecap 0004[128] = Power Budgeting 1
    ecap 000b[600] = Vendor 1 ID 1
    ecap 0019[900] = PCIe Sec 1 lane errors 0
vishwin commented 7 years ago

For reference, this is what the panic looks like now (as of the latest drm-next-kmod in ports):

ardmore dumped core - see /var/crash/vmcore.0

Mon Oct 23 04:08:09 EDT 2017

FreeBSD ardmore 12.0-CURRENT FreeBSD 12.0-CURRENT #1 fcca5326804(master): Mon Oct 23 03:38:35 EDT 2017     root@ardmore:/usr/local/obj/usr/local/src/sys/GENERIC  amd64

panic: page fault

GNU gdb (GDB) 8.0.1 [GDB v8.0.1 for FreeBSD]
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd12.0".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /boot/kernel/kernel...Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...done.
done.

Unread portion of the kernel message buffer:
<6>[drm] GPU HANG: ecode 8:0:0xfffffffe, reason: Hang on render ring, action: reset
<6>[drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
<6>[drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
<6>[drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
<6>[drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
<6>[drm] GPU crash dump saved to /sys/class/drm/card0/error
<5>drm/i915: Resetting chip after gpu hang

Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address   = 0xa8
fault code      = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff847f552c
stack pointer           = 0x28:0xfffffe0233ee9580
frame pointer           = 0x28:0xfffffe0233ee95e0
code segment        = base 0x0, limit 0xfffff, type 0x1b
            = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = interrupt enabled, resume, IOPL = 0
current process     = 0 (linuxkpi_long_wq_3)
trap number     = 12
panic: page fault
cpuid = 1
time = 1508746026
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0233ee9160
vpanic() at vpanic+0x19c/frame 0xfffffe0233ee91e0
panic() at panic+0x43/frame 0xfffffe0233ee9240
trap_fatal() at trap_fatal+0x352/frame 0xfffffe0233ee9290
trap_pfault() at trap_pfault+0x62/frame 0xfffffe0233ee92f0
trap() at trap+0x2c5/frame 0xfffffe0233ee94b0
calltrap() at calltrap+0x8/frame 0xfffffe0233ee94b0
--- trap 0xc, rip = 0xffffffff847f552c, rsp = 0xfffffe0233ee9580, rbp = 0xfffffe0233ee95e0 ---
reset_common_ring() at reset_common_ring+0x12c/frame 0xfffffe0233ee95e0
i915_gem_reset_engine() at i915_gem_reset_engine+0xef/frame 0xfffffe0233ee9640
i915_gem_reset() at i915_gem_reset+0x62/frame 0xfffffe0233ee9670
i915_reset() at i915_reset+0x162/frame 0xfffffe0233ee96d0
i915_reset_and_wakeup() at i915_reset_and_wakeup+0xc9/frame 0xfffffe0233ee9730
i915_handle_error() at i915_handle_error+0x154/frame 0xfffffe0233ee9830
i915_hangcheck_elapsed() at i915_hangcheck_elapsed+0x654/frame 0xfffffe0233ee9970
linux_work_fn() at linux_work_fn+0xf1/frame 0xfffffe0233ee99e0
taskqueue_run_locked() at taskqueue_run_locked+0x15d/frame 0xfffffe0233ee9a40
taskqueue_thread_loop() at taskqueue_thread_loop+0x88/frame 0xfffffe0233ee9a70
fork_exit() at fork_exit+0x84/frame 0xfffffe0233ee9ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0233ee9ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Uptime: 12s
Dumping 482 out of 8058 MB:..4%..14%..24%..34%..44%..54%..63%..73%..83%..93%

__curthread () at ./machine/pcpu.h:232
232     __asm("movq %%gs:%1,%0" : "=r" (td)
(kgdb) #0  __curthread () at ./machine/pcpu.h:232
#1  doadump (textdump=1) at /usr/local/src/sys/kern/kern_shutdown.c:317
#2  0xffffffff80a6b8d5 in kern_reboot (howto=260)
    at /usr/local/src/sys/kern/kern_shutdown.c:385
#3  0xffffffff80a6bec6 in vpanic (fmt=<optimized out>, ap=0xfffffe0233ee9220)
    at /usr/local/src/sys/kern/kern_shutdown.c:778
#4  0xffffffff80a6bf13 in panic (fmt=<unavailable>)
    at /usr/local/src/sys/kern/kern_shutdown.c:709
#5  0xffffffff80f14d92 in trap_fatal (frame=0xfffffe0233ee94c0, eva=168)
    at /usr/local/src/sys/amd64/amd64/trap.c:799
#6  0xffffffff80f14e02 in trap_pfault (frame=0xfffffe0233ee94c0, usermode=0)
    at /usr/local/src/sys/amd64/amd64/trap.c:653
#7  0xffffffff80f145c5 in trap (frame=0xfffffe0233ee94c0)
    at /usr/local/src/sys/amd64/amd64/trap.c:420
#8  <signal handler called>
#9  0xffffffff847f552c in ?? ()
#10 0xfffffe0233ee95b0 in ?? ()
#11 0xffffffff846f0e29 in intel_crtc_cursor_set (crtc=0xfffffe0002edce20, 
    file=<optimized out>, handle=<optimized out>, width=49139184, 
    height=4294901760)
    at /usr/local/src/sys/dev/drm2/i915/intel_display.c:6479
#12 0xffffffff846e83ff in assert_panel_unlocked (dev_priv=<optimized out>, 
    pipe=<optimized out>)
    at /usr/local/src/sys/dev/drm2/i915/intel_display.c:1192
#13 ironlake_pch_enable (crtc=<optimized out>)
    at /usr/local/src/sys/dev/drm2/i915/intel_display.c:3144
#14 ironlake_crtc_enable (crtc=0xfffffe0002ed4d38)
    at /usr/local/src/sys/dev/drm2/i915/intel_display.c:3388
#15 0xffffffff846e8262 in ironlake_enable_pch_pll (intel_crtc=<optimized out>)
    at /usr/local/src/sys/dev/drm2/i915/intel_display.c:1603
#16 ironlake_pch_enable (crtc=<optimized out>)
    at /usr/local/src/sys/dev/drm2/i915/intel_display.c:3115
#17 ironlake_crtc_enable (crtc=0x1fffe0002ed3000)
    at /usr/local/src/sys/dev/drm2/i915/intel_display.c:3388
#18 0xffffffff846dba52 in intel_crt_set_dpms (
    encoder=0xffffffff846e8262 <ironlake_crtc_enable+1554>, mode=3)
    at /usr/local/src/sys/dev/drm2/i915/intel_crt.c:115
#19 intel_disable_crt (encoder=0xffffffff846e8262 <ironlake_crtc_enable+1554>)
    at /usr/local/src/sys/dev/drm2/i915/intel_crt.c:120
#20 0xffffffff8472e5c9 in ?? () from /boot/kernel/i915kms.ko
#21 0xfffffe0002edcdf0 in ?? ()
#22 0x0000000000000003 in ?? ()
#23 0xfffff80007a353e0 in ?? ()
#24 0x0000000000000002 in ?? ()
#25 0xfffffe0233ee9730 in ?? ()
#26 0x00ffffff00000001 in ?? ()
#27 0x0000000000000003 in ?? ()
#28 0x0000000000000000 in ?? ()
(kgdb) 

If it's going to panic, it will always do so after exactly 12 seconds of uptime.

mattmacy commented 7 years ago

Ok, it helps to know where the null pointer dereference is. If @markjdb and @hselasky don't have time I'll take a look on the weekend.

cperciva commented 7 years ago

I have a System76 Galago Pro (https://wiki.freebsd.org/Laptops/System76%20Galago%20Pro) and see exactly the same warning and panic.

hselasky commented 7 years ago

I remember there was a tunable you could set to no do that GPU hang check. sysctl compat.linuxkpi.enable_hangcheck=0

vishwin commented 7 years ago

I'm currently having consistent panics after updating -CURRENT last night. Trying to disable hangcheck via sysctl, but alas, sysctl: unknown oid 'compat.linuxkpi.enable_hangcheck'

cperciva commented 7 years ago

@vishwin You need to set it in /boot/loader.conf.

vishwin commented 7 years ago

Don't remember that working either. The panics have stopped for now so I will try it when it decides to repeatedly panic again.

vishwin commented 7 years ago

Okay, so the loader.conf tunable works. However, suspend via acpiconf -s 3 is borked with the tunable set to disable hangcheck. Sometimes the screen shuts off, sometimes it hangs on whatever kernel messages are scrolling as the ACPI state changes, but both result in a vegetative state (for lack of a better term) and only a hard reset or power cycle will cure things (albeit having to boot again).

cperciva commented 5 years ago

Closing this since it has long since been fixed.