Open trombonehero opened 7 years ago
Here's what the screen looks like:
If I press the power button, the machine does shut down after a minute or two.
I have had this issue for a few months now. It did work at some point in the past but I haven't gone through building old kernels to figure out which commit broke it. I recall it was some time between late January and early March. Is this a regression of #68 ? The visual symptoms are identical but the cause may not be same since that bug seemed to be fixed.
/var/log/messages contains many repeats of
Jun 11 15:28:48 theron-xps kernel: : 0x00000080
Jun 11 15:28:48 theron-xps kernel: [drm:gen8_de_irq_handler] Fault errors on pipe A
Jun 11 15:28:48 theron-xps kernel: : 0x00000080[drm:gen8_de_irq_handler] Fault errors on pipe A
The source seems to have been refactored since https://bugs.freedesktop.org/show_bug.cgi?id=91697#c21 but I monkey-merged it with e0c82f4 of drm-next and upon resuming I do see the the desktop instead of the screen full of lines but the system remains graphically unresponsive and can be rebooted by pressing power button.
messages.txt intel-resume.diff.txt
I haven't tried to understand the driver code so I have very little idea what the change is doing or should be doing, but symptomatically it appears to be a step in the right direction.
@therontarigo Upon resume, are you able to ssh into the system? If so, could you please collect the output of:
$ top -SHza -d 1 | cat $ procstat -kka $ dmesg
?
I managed to get the top
output via SSH before the system became unresponsive:
$ top -SHza -d 1 | cat 8:53:14
last pid: 81305; load averages: 0.65, 0.71, 0.72 up 0+02:32:15 08:53:15
739 processes: 6 running, 716 sleeping, 1 zombie, 16 waiting
Mem: 1189M Active, 1083M Inact, 30M Laundry, 1909M Wired, 3605M Free
ARC: 1103M Total, 354M MFU, 702M MRU, 1899K Anon, 9447K Header, 35M Other
908M Compressed, 1642M Uncompressed, 1.49:1 Ratio, 150M Overhead
Swap: 4096M Total, 4096M Free
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
47533 jon 90 0 2298M 875M CPU3 3 1:51 62.35% thunderbird{thunderbird}
12 root -92 - 0K 384K WAIT 2 0:53 6.79% [intr{irq16: vgapci0}]
47533 jon 24 0 2298M 875M uwait 1 0:00 4.20% thunderbird{StreamTrans #20}
1222 jon 21 0 249M 73892K select 2 6:17 3.08% X :0 (Xorg){Xorg}
47533 jon 22 0 2298M 875M uwait 2 0:04 2.69% thunderbird{mozStorage #1}
686 root 20 0 10568K 2148K select 3 0:03 1.76% /usr/sbin/syslogd -s
0 root -12 - 0K 6864K - 3 0:19 1.17% [kernel{zio_write_issue_1}]
0 root -12 - 0K 6864K - 3 0:19 1.17% [kernel{zio_write_issue_2}]
0 root -12 - 0K 6864K - 0 0:19 1.07% [kernel{zio_write_issue_0}]
47533 jon 20 0 2298M 875M uwait 3 0:00 0.78% thunderbird{thunderbird}
47533 jon 21 0 2298M 875M uwait 2 0:00 0.59% thunderbird{thunderbird}
80920 jon 23 0 27124K 5964K pause 0 0:00 0.29% -zsh (zsh)
80439 jon 30 10 79520K 8768K select 3 0:00 0.29% bsod -root -atari -bsd -sparclinux
5 root -8 - 0K 176K dbuf_e 2 0:01 0.20% [zfskern{dbuf_evict_thread}]
47533 jon 20 0 2298M 875M uwait 3 0:01 0.20% thunderbird{JS Helper}
1253 jon 20 0 91936K 16976K usem 2 1:19 0.10% conky{conky}
5 root -8 - 0K 176K tx->tx 2 0:13 0.10% [zfskern{txg_thread_enter}]
4070 jon 20 0 731M 274M select 0 3:37 0.00% chrome: TaskSchedulerForegroundWorker0 (chrome){chrome}
I notice that my swap partition is full, which is perhaps contributing to the loss of responsiveness. Anyhow, I'll try to get the procstat
and dmesg
output later today.
Hm, no, the swap partition isn't used at all here. I don't see anything particularly odd in the top output - I think the dmesg would be more helpful.
I also get screen corruption on kabylake + intel/SNA ddx. it is not as bad as the originator of this ticket though. I also am not swapping out and the procstat command does not show any processes:
$ procstat -kka
PID TID COMM TDNAME KSTACK
Also, no interesting events in my dmesg buffer since the corruption starts. What may be interesting though is I have two displays connected on this laptop. The corruption always happens on the laptops eDP display, not the HDMI connected display.
I've also got debugfs mounted, but am not seeing anything of interest in there. let me know if there are other datapoints which may be of help
In the off chance that the info in i915_context_status from debugfs is helpful here it is:
cat i915_context_status
HW context 0 (kernel) r
render ring: I0xfffff80009cc0780K: gM 92KiB 41 00 [ 0 0 0 0 ] 0 LLC dirty (pinned x 1) (ggtt offset: 007e9000, size: 00017000, type: 0) (ringbuffer, space: 14904, head: 0, tail: 40, last head: 0)
blitter ring: I0xfffff80009cc0280K: gM 12KiB 41 00 [ 0 0 0 0 ] 0 LLC dirty (pinned x 1) (ggtt offset: 00804000, size: 00003000, type: 0) (ringbuffer, space: 16192, head: 0, tail: 0, last head: 0)
bsd ring: I0xfffff80009cbfc80K: gM 12KiB 41 00 [ 0 0 0 0 ] 0 LLC dirty (pinned x 1) (ggtt offset: 0080b000, size: 00003000, type: 0) (ringbuffer, space: 16320, head: 0, tail: 0, last head: -1)
video enhancement ring: I0xfffff80009cbf780K: gM 12KiB 41 00 [ 0 0 0 0 ] 0 LLC dirty (pinned x 1) (ggtt offset: 00812000, size: 00003000, type: 0) (ringbuffer, space: 16320, head: 0, tail: 0, last head: -1)
HW context 1 (Xorg [16434]) r
render ring: I0xfffff8006d8a0780K: gM 92KiB 01 01 [ 0 0 0 0 ] 0 LLC dirty (pinned x 1) (ggtt offset: 00819000, size: 00017000, type: 0) (ringbuffer, space: 8856, head: 14552, tail: 5632, last head: 5592)
blitter ring: I0xfffff8006d508000K: gM 12KiB 01 01 [ 0 0 0 0 ] 0 LLC dirty (pinned x 1) (ggtt offset: 01829000, size: 00003000, type: 0) (ringbuffer, space: 9072, head: 11328, tail: 2192, last head: 2160)
bsd ring: i
video enhancement ring: i
HW context 2 (xfce4-terminal [35765]) r
render ring: i
blitter ring: i
bsd ring: i
video enhancement ring: i
HW context 3 (wrapper-2.0 [35906]) r
render ring: i
blitter ring: i
bsd ring: i
video enhancement ring: i
HW context 4 (Web Content [86909]) r
render ring: i
blitter ring: I0xfffff8022f349280K: gM 12KiB 01 01 [ 0 0 0 0 ] 0 LLC dirty (pinned x 0) (ringbuffer, space: 16240, head: 0, tail: 80, last head: 48)
bsd ring: i
video enhancement ring: i
HW context 5 (Web Content [86909]) r
render ring: I0xfffff80131c09a00K: gM 92KiB 01 01 [ 0 0 0 0 ] 0 LLC dirty (pinned x 0) (ringbuffer, space: 15176, head: 0, tail: 1144, last head: 1104)
blitter ring: i
bsd ring: i
video enhancement ring: i
HW context 7 (Web Content [86909]) r
render ring: I0xfffff801a633e280K: gM 92KiB 01 01 [ 0 0 0 0 ] 0 LLC dirty (pinned x 0) (ggtt offset: 04952000, size: 00017000, type: 0) (ringbuffer, space: 15048, head: 0, tail: 1272, last head: 1232)
blitter ring: i
bsd ring: i
video enhancement ring: i
HW context 9 (Web Content [86909]) r
render ring: I0xfffff80200e80c80K: gM 92KiB 01 01 [ 0 0 0 0 ] 0 LLC dirty (pinned x 0) (ggtt offset: 049c2000, size: 00017000, type: 0) (ringbuffer, space: 15048, head: 0, tail: 1272, last head: 1232)
blitter ring: i
bsd ring: i
video enhancement ring: i
HW context 6 (Web Content [86909]) r
render ring: I0xfffff80107907500K: gM 92KiB 01 01 [ 0 0 0 0 ] 0 LLC dirty (pinned x 0) (ggtt offset: 03952000, size: 00017000, type: 0) (ringbuffer, space: 15048, head: 0, tail: 1272, last head: 1232)
blitter ring: i
bsd ring: i
video enhancement ring: i
Oops, that was 4096M free, not 4096M used! :)
The output of procstat -kka
is:
PID TID COMM TDNAME KSTACK
1090 100672 tmux - mi_switch+0x18b sleepq_switch+0x10f sleepq_catch_signals+0x308 sleepq_wait_sig+0xf _cv_wait_sig+0x1fd seltdwait+0x8d kern_poll+0x3f8 sys_poll+0x50 amd64_syscall+0x589 Xfast_syscall+0xfb
1100 100643 tmux - mi_switch+0x18b sleepq_switch+0x10f sleepq_catch_signals+0x308 sleepq_timedwait_sig+0x14 _cv_timedwait_sig_sbt+0x220 seltdwait+0x6b kern_poll+0x3f8 sys_poll+0x50 amd64_syscall+0x589 Xfast_syscall+0xfb
1106 100670 zsh - mi_switch+0x18b sleepq_switch+0x10f sleepq_catch_signals+0x308 sleepq_wait_sig+0xf _cv_wait_sig+0x1fd tty_wait+0x42 ttydisc_read+0x233 ttydev_read+0x4b devfs_read_f+0xe0 dofileread+0xba kern_readv+0x68 sys_read+0x86 amd64_syscall+0x589 Xfast_syscall+0xfb
1630 100711 zsh - mi_switch+0x18b sleepq_switch+0x10f sleepq_catch_signals+0x308 sleepq_wait_sig+0xf _sleep+0x32a kern_sigsuspend+0xb4 sys_sigsuspend+0x31 amd64_syscall+0x589 Xfast_syscall+0xfb
1794 100571 procstat - sysctl_kern_proc_kstack+0x1df sysctl_root_handler_locked+0x90 sysctl_root+0x1c4 userland_sysctl+0x148 sys___sysctl+0x5f amd64_syscall+0x589 Xfast_syscall+0xfb
1795 100622 tee - mi_switch+0x18b sleepq_switch+0x10f sleepq_catch_signals+0x308 sleepq_wait_sig+0xf _sleep+0x32a pipe_read+0x318 dofileread+0xba kern_readv+0x68 sys_read+0x86 amd64_syscall+0x589 Xfast_syscall+0xfb
The output of dmesg
is around 1,600 lines of:
: 0x00000080[drm:gen8_de_irq_handler] Fault errors on pipe A
Hopefully that's a bit more illuminating.
Just FYI, when the machine (running da5f90154f123ea316971de3e096f29b528a8c28) is plugged into an external display (via either DP or HDMI), I see the same pattern on both displays:
Even more interestingly, when the computer is in this state it starts making an unusual sound a few seconds after resuming: http://www.engr.mun.ca/~anderson/spectre-sound.m4a
Okay, I found that the exact commit to break this was 41b97ee, which in fact only updated the firmware. By reversing it I now have working resume.
diff --git a/sys/modules/drm/i915/i915kmsfw/skldmc/Makefile b/sys/modules/drm/i915/i915kmsfw/skldmc/Makefile
index d6c389c24d8..c066a25daae 100644
--- a/sys/modules/drm/i915/i915kmsfw/skldmc/Makefile
+++ b/sys/modules/drm/i915/i915kmsfw/skldmc/Makefile
@@ -1,7 +1,7 @@
# $FreeBSD$
-KMOD = i915_skl_dmc_ver1_26_bin
-NAME = i915/skl_dmc_ver1_26.bin
-IMG = skl_dmc_ver1_26
+KMOD = i915_skl_dmc_ver1_bin
+NAME = i915/skl_dmc_ver1.bin
+IMG = skl_dmc_ver1
.include <bsd.kmod.mk>
diff --git a/sys/modules/drm/i915/i915kmsfw/sklguc/Makefile b/sys/modules/drm/i915/i915kmsfw/sklguc/Makefile
index 509b6e2b795..b787d2d9540 100644
--- a/sys/modules/drm/i915/i915kmsfw/sklguc/Makefile
+++ b/sys/modules/drm/i915/i915kmsfw/sklguc/Makefile
@@ -1,6 +1,6 @@
# $FreeBSD$
-KMOD = i915_skl_guc_ver6_1_bin
-NAME = i915/skl_guc_ver6_1.bin
-IMG = skl_guc_ver6_1
+KMOD = i915_skl_guc_ver4_bin
+NAME = i915/skl_guc_ver4.bin
+IMG = skl_guc_ver4
.include <bsd.kmod.mk>
Is this Intel's bug or a failure elsewhere in the kernel code to make a change necessary to become compatible with newer firmware? Using the older firmware seems for now to be the solution to go with as a user.
@therontarigo: thank you! Applying this workaround also fixes my resume corruption.
@therontarigo nice find! @hselasky do you see what the problem is here?
@therontarigo @markjdb : Reverting the firmware version installed will break other setups, because it basically means the default firmware will be used instead of the shipped one. I.E. Applying this patch will prevent the firmware from be loaded, because the kernel always asks for ver6_1 ?? Can you check this in dmesg?
@hselasky Indeed,
[drm:intel_guc_setup] GuC fw status: path i915/skl_guc_ver6_1.bin, fetch NONE, load NONE
[drm] GuC firmware load skipped
It seems likely then that the workaround for some reason depends on the absence of the firmware and the problem would come back if the kernel were to load the ver4 guc and ver1 dmc.
Why is the kernel still trying to load the newer file? There are absolutely no instances of the string "ver6_1" in my source tree!
@therontarigo : Look for this:
#define SKL_FW_MAJOR 6
#define SKL_FW_MINOR 1
They define the firmware versions to be loaded. Can you check if a newer firmware revision is available?
6.1 is the latest as per this Intel site:
https://01.org/linuxgraphics/downloads/firmware https://01.org/linuxgraphics/downloads/skylake-guc-6.1
Also wanted to mention I enabled loading GuC on my Skylake system and the firmware loads as expected:
[drm:guc_fw_fetch] fetch GuC fw from i915/skl_guc_ver6_1.bin succeeded, fw 0xfffff8001232d6c0
[drm:guc_fw_fetch] firmware version 6.1 OK (minimum 6.1)
[drm:guc_fw_fetch] GuC fw fetch status SUCCESS, obj 0xfffff80012325c80
@hselasky Thanks, I figured it might be something like that. Now I see that the Makefile that was modified only controls what is copied upon kernel install.
By not loading these firmwares, are scheduling and certain sleep tasks falling back on kernel code as the documentation at https://01.org/linuxgraphics/downloads/firmware seems to imply, or is there a "default firmware" somewhere in ROM or elsewhere in the source tree?
@therontarigo : Did you try to search for similar issues ? Also did you try the very latest drm-next branch? There is also an effort to upgrade to Linux 4.10. Maybe this issue is already known and fixed.
Just FYI, I still see this corruption on resume with ad07fe7f5b4.
@pewright-tronc: when you say that you "enabled loading GuC"... how do you do that? I've tried kldload i915_skl_skl_guc_ver6_1_bin.ko
, but I still see GuC fw status: path i915/skl_guc_ver6_1.bin, fetch NONE, load NONE
in the output from dmesg
. Do I need to copy the firmware file into some well-known location in the filesystem?
from what i've found - loading i915 related firmware has been hit or miss. on my skylake box when i load the i915kms driver, it automatically finds the skl GuC and DMC firmware and loads it. yet on a kabylake box, when i build the appropriate firmware modules the DMC does load, but i get an identical error that you see for the GuC bits.
My guess is the "fetch NONE, load NONE" error is a red-herring- although unfortunately i don't have anything to back that up at the moment :(
do you happen to see the DMC firmware getting loaded on your system - it's named "i915_skl_dmc_ver1_26_bin.ko"
Ok, so it sounds like perhaps I shouldn't worry about the GuC thing...
Also, yes, the DMC firmware gets loaded automatically when I kldload i915kms
.
It looks like @cperciva has now gotten to the point of seeing the same coloured bars as me... progress of a kind? :)
It is GuC that always fails to load. DMC loads without issue, but the video corruption occurs when it is loaded - deleting the module containing the firmware works around the issue, but likely at the cost of greater power consumption.
I have heard that Skylake video resume is not an issue on OpenBSD - could this be as simple as they do not load the firmware in the first place, as it cannot be audited?
Thanks to everyone for the digging and discussion. Just wanted to add a note regarding my positive experience using the workaround. Running with top-of-tree drm-next GENERIC (rev 300ce5decf0) + current FreeBSD.org generic pkgs on a Dell Latitude E7470 with an i7-6600U, I've never had problems suspending, and the following gets me functional resume without the graphical stripe corruption:
mv /boot/kernel/i915_skl_dmc_ver1_26_bin.ko /boot/kernel/noload_i915_skl_dmc_ver1_26_bin.ko
On this machine, only i915_skl_dmc_ver1_26_bin.ko is loaded automatically by i915kms.ko, and i915_skl_guc_ver6_1_bin.ko is not.
I tried explicitly loading both dmc and guc modules from rc.conf like so:
kld_list="i915_skl_guc_ver6_1_bin i915_skl_dmc_ver1_26_bin i915kms if_iwm iwm8000Cfw"
It made no difference...
@lastewart : thanks very much! This workaround works for me too, and lets me run with GENERIC
base packages (+ mv
) instead of munging a Makefile
in the build.
Just a quick update: my (Skylake) notebook is still experiencing this issue as of f304e52bd7a, though the prevously-proposed workaround continues to help. I wonder: are there examples of Skylake machines that do not experience this video corruption on resume? If not, might it be worth disabling the skl_dmc
firmware in the relevant Makefile
of the drm-next
branch?
Hi,
Can you try this patch: https://github.com/FreeBSDDesktop/freebsd-base-graphics/commit/06c44265a187d60a462e5735278eee6c3566e6bc
--HPS
Have you tried to set "sysctl hw.acpi.reset_video=1" before suspending ?
It seems that the system doesn't resume with that sysctl enabled: I get a black screen and an unresponsive machine (pressing the power button again doesn't shut the system down, for example).
I have this exact same issue with lenovo X1. Setting "sysctl hw.acpi.reset_video=1" prevents the system from resuming altogether. Otherwise I have the same pinstripe looking video curroption as shown in the screenshot. This is running stock CURRENT r330606 with 4.11.g20180224 installed fresh today. Let me know if I can provide any additional debugging to resolve this.
@lastewart
To load the guc you need to force it since it's disabled by default in the intel driver. For me it fails to load about half of the times. I think my skylake GPU hangs with the new firmware which might be why intel has not enabled it yet (the system continues to boot as normal with guc disabled after automatic GPU reset).
Add to /boot/loader.conf
compat.linuxkpi.enable_guc_submission=2
compat.linuxkpi.enable_guc_loading=2
scatterlist.h is idential to what is in drm-next branch in freebsd-base-graphics. It has my fixes from August:
commit a7dcabca4964e1f8fc146265fd98bacfcceca754 Author: Hans Petter Selasky hps@selasky.org Date: Wed Aug 16 16:31:35 2017 +0200
scatterlist.h is idential to what is in drm-next branch in freebsd-base-graphics. It has my fixes from August
Yes I realized that the comment was very old :)
@johalun
To load the guc you need to force it
Dumb question here, but why do we want the guc? Does it do anything aside from giving us mangled video output on resume?
(I mean, I assume it's supposed to do something, but...)
@cperciva I think Intel's plan is to replace execlists with guc as a means to submit commands to the gpu. Right now we don't have to care but we need to be ready when they hit the switch!
What's the plan to overcome this issue? Remove dmc firmware from /boot/modules ?
@abishai If that solves your problem, yes do that for now until we can locate the source of the problem.
@johalun Yes, it does. During boot sequence, driver warns me that runtime power management is disabled, but I failed to measure, at least with acpiconf -i 0, the difference. Numbers seems to be the same.
@johalun On a HP EliteBook 1040 G3 (Skylake i7-6600U), moving the dmc
named modules helps me suspend and resume. Not that I really do that (I prefer hard shutdowns, old habit, don't care to break it because it saves battery).
Issue persists (unless firmware modules are deleted) on FreeBSD 12.0-CURRENT r335560, drm-stable-kmod-g20180606 (from ports).
Linux had this problem, but it was fixed: https://bugs.freedesktop.org/show_bug.cgi?id=91697 A question to try to answer is whether the fix referenced in #68 solved this problem for only some but not all hardware or instead the problem is somehow unrelated.
Kernel output during the problem remains as always:
kernel: [drm:gen8_de_irq_handler] Fault errors on pipe A
kernel: : 0x00000080[drm:gen8_de_irq_handler] Fault errors on pipe A
syslogd: last message repeated 556 times
Issues #149 and #68 seem as exact duplicates.
For what it is worth, the symptoms have been encountered even on Windows: https://software.intel.com/en-us/forums/graphics-driver-bug-reporting/topic/747997 https://communities.intel.com/thread/118447
Why is the dmc firmware needed, and why is it packaged if it causes resume issues?
@pkgdemon Newer firmware brings updates and fixes to GPU microcode, should be OK without for basic functionality. The resume issue is most likely not related to the firmware itself but how the firmware is loaded in FreeBSD / LinuxKPI. Investigating this closer is on the todo list..
@pkgdemon I have strong suspicion that without dmc, plasma5 compositor becomes unstable (producing artifacts from time to time) with OpenGL backend when doing vsync
@abishai Does the combination of absent dmc with drm-stable-kmod produce these artifacts? When I used drm-next-kmod in combination with no dmc (never tried with dmc for comparison) I encountered severe artifacts in Xorg after switching consoles or toggling display outputs; drm-stable-kmod does not show this problem.
@therontarigo Yes, I run drm-stable-kmod. Visually, it's screen redraw glitches used in vsync. They go away if I toggle OpenGL (2 to 3.1 or back) or to xrandr in compositor settings and occurs very infrequently. However, I can't be sure if it's directly because of dmc. Maybe, it's result of suspend/resume. For obvious reasons, I'm not use it with dmc present.
This seems resolved:
12.0-ALPHA8 FreeBSD 12.0-ALPHA8 #3 r339095M
% kldstat | grep i915
9 1 0xffffffff83118000 11d800 i915kms.ko
15 1 0xffffffff832cc000 245d i915_skl_dmc_ver1_26_bin.ko
Suspend and resume with no (graphics-related) problems...
Confirmed, the latest version works fine. IIRC someone at BSDCam said that it was an issue with reloading the firmware after resume.
I have no problems either on the HP 1040 G3. However, I don't suspend and resume other than to test the laptop. I turn off my laptop completely. It's an old habit from from the days when the best CPU my laptop had was a Celeron M that I refuse to break (my current laptop has an i7).
When I resume from suspend, I see the following corruption. This has been going on for a couple of months; @markjdb asked me to retest this week, but alas, the corruption continues.