Laptop freezes when starting X11 and discrete graphics are OFF

jgkamat commented 8 years ago

[edit by @Lekensteyn] This issue affects newer laptops (from about 2015-2016) with Skylake and GTX 9xxM/10xx cards/ A workaround exists for some laptops, see https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-234494238 [/edit]

I'm having a weird issue, and I'm not sure what kind of debug information is neccesary, but let me know what to give and I'll supply anything you need.

When I start my graphics (lxdm), I get a freeze (keyboard stops working, no response on monitor at all, even log files stop working), but I can work around this by enabling the graphics card before starting graphics.

System (installed with bumblebee-nvidia in debian testing repos):

Debian Testing
GTX 965M
Nvidia Proprietary Driver: 352.79 
Laptop: SAGER NP7258

Optirun --version:

optirun (Bumblebee) 3.2.1
Copyright (C) 2011 The Bumblebee Project
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

My laptop seems to not work without optimus, the intel drivers work fine, but trying to run w/o the intel drivers (nvidia only) seems to result in a frozen screen. Using the workaround works perfectly for me, however.

Steps to Reproduce:

systemctl start bumblebeed
systemctl start lxdm
Freeze occurs

Workaround:

systemctl start bumblebeed
echo "ON" >/proc/acpi/bbswitch
systemctl start lxdm

Unfortunately, any X11 log files don't seem to survive after my system freezes (they show everything completed successfully, probably from the previous successfull boot). If you know any way of retreiving them I'd be happy to supply them though! (When the system freezes, even my shell history file gets corrupted).

I did have to make some changes to my config files to get things to work in my situation though, I'll post anything I remember changing below. Let me know if you need any more information, I am happy to supply it! Without bumblebee, my laptop would be unusuable :+1:

bumblebee.conf

# Configuration file for Bumblebee. Values should **not** be put between quotes

## Server options. Any change made in this section will need a server restart
# to take effect.
[bumblebeed]
# The secondary Xorg server DISPLAY number
VirtualDisplay=:8
# Should the unused Xorg server be kept running? Set this to true if waiting
# for X to be ready is too long and don't need power management at all.
KeepUnusedXServer=false
# The name of the Bumbleblee server group name (GID name)
ServerGroup=bumblebee
# Card power state at exit. Set to false if the card shoud be ON when Bumblebee
# server exits.
TurnCardOffAtExit=false
# The default behavior of '-f' option on optirun. If set to "true", '-f' will
# be ignored.
NoEcoModeOverride=false
# The Driver used by Bumblebee server. If this value is not set (or empty),
# auto-detection is performed. The available drivers are nvidia and nouveau
# (See also the driver-specific sections below)
Driver=nvidia
# Directory with a dummy config file to pass as a -configdir to secondary X
XorgConfDir=/etc/bumblebee/xorg.conf.d

## Client options. Will take effect on the next optirun executed.
[optirun]
# Acceleration/ rendering bridge, possible values are auto, virtualgl and
# primus.
Bridge=auto
# The method used for VirtualGL to transport frames between X servers.
# Possible values are proxy, jpeg, rgb, xv and yuv.
VGLTransport=proxy
# List of paths which are searched for the primus libGL.so.1 when using
# the primus bridge
PrimusLibraryPath=/usr/lib/x86_64-linux-gnu/primus:/usr/lib/i386-linux-gnu/primus:/usr/lib/primus:/usr/lib32/primus
# Should the program run under optirun even if Bumblebee server or nvidia card
# is not available?
AllowFallbackToIGC=false

# Driver-specific settings are grouped under [driver-NAME]. The sections are
# parsed if the Driver setting in [bumblebeed] is set to NAME (or if auto-
# detection resolves to NAME).
# PMMethod: method to use for saving power by disabling the nvidia card, valid
# values are: auto - automatically detect which PM method to use
#         bbswitch - new in BB 3, recommended if available
#       switcheroo - vga_switcheroo method, use at your own risk
#             none - disable PM completely
# https://github.com/Bumblebee-Project/Bumblebee/wiki/Comparison-of-PM-methods

## Section with nvidia driver specific options, only parsed if Driver=nvidia
[driver-nvidia]
# Module name to load, defaults to Driver if empty or unset
KernelDriver=nvidia-current
PMMethod=bbswitch
# colon-separated path to the nvidia libraries
LibraryPath=/usr/lib/x86_64-linux-gnu/nvidia:/usr/lib/i386-linux-gnu/nvidia:/usr/lib/nvidia
# comma-separated path of the directory containing nvidia_drv.so and the
# default Xorg modules path
XorgModulePath=/usr/lib/nvidia,/usr/lib/xorg/modules
XorgConfFile=/etc/bumblebee/xorg.conf.nvidia

## Section with nouveau driver specific options, only parsed if Driver=nouveau
[driver-nouveau]
KernelDriver=nouveau
PMMethod=auto
XorgConfFile=/etc/bumblebee/xorg.conf.nouveau

xorg.conf.nvidia

Section "ServerLayout"
Identifier  "Layout0"
Option      "AutoAddDevices" "false"
Option      "AutoAddGPU" "false"
EndSection

Section "Device"
Identifier  "DiscreteNvidiaj"
Driver      "nvidia"
VendorName  "NVIDIA Corporation"

#   If the X server does not automatically detect your VGA device,
#   you can manually set it here.
#   To get the BusID prop, run `lspci | egrep 'VGA|3D'` and input the data
#   as you see in the commented example.
#   This Setting may be needed in some platforms with more than one
#   nvidia card, which may confuse the proprietary driver (e.g.,
#   trying to take ownership of the wrong device). Also needed on Ubuntu 13.04.
BusID "PCI:01:00:0"

#   Setting ProbeAllGpus to false prevents the new proprietary driver
#   instance spawned to try to control the integrated graphics card,
#   which is already being managed outside bumblebee.
#   This option doesn't hurt and it is required on platforms running
#   more than one nvidia graphics card with the proprietary driver.
#   (E.g. Macbook Pro pre-2010 with nVidia 9400M + 9600M GT).
#   If this option is not set, the new Xorg may blacken the screen and
#   render it unusable (unless you have some way to run killall Xorg).
Option "ProbeAllGpus" "false"

Option "NoLogo" "true"
Option "UseEDID" "false"
Option "UseDisplayDevice" "none"
EndSection

# Section "Screen"
#     Identifier "Default Screen"
#   Device "DiscreteNvidia"
# EndSection

bluca commented 8 years ago

If you run:

sudo update-glx --config glx

What is the selected config? It should be /usr/lib/nvidia/bumblebee.

Does the same problem happen if you choose /usr/lib/mesa-diverted instead?

Finally, do you have another DE to try (Gnome would be best) to help narrow it down?

jgkamat commented 8 years ago

I've been using /usr/lib/nvidia/bumblebee so far, I tried out mesa-diverted and I have the same result. I've tried this with starting lxdm, manually runing startx to start xfce, and sddm (kde), and all have the same behavior. If you think gdm would help I'll try that out but I would rather not install all of gnome.

bluca commented 8 years ago

/usr/lib/nvidia/bumblebee is the right one (default) when having bumblebee, I wanted to see if removing all traces of nvidia from the path helped.

It is really strange that X is affected by bumblebee when not running through it. Can you get to another TTY when the screen is frozen?

Don't bother with GDM for now if it's a hassle, was just trying to narrow it down. I'll install xfce on my sid partition and see what happens.

jgkamat commented 8 years ago

I think this is an issue specific to my hardware setup (as descrete graphics cannot be forced on, optimus must be used). When I say 'the screen is frozen', the TTY I am in (I'm manually starting a display manager) stops responding (the cursor stops blinking). I can't switch to another TTY. Even the keyboard caps lock/numlock lights no longer change when I press them, and the SysReq keys no longer work either. The system has to be force powered off.

jgkamat commented 8 years ago

I just double checked, but ssh sessions freeze too when this occurs.

bluca commented 8 years ago

A kernel hard-lock then, that's a pain. Have you tried nouveau?

karolherbst commented 8 years ago

maybe nouveau is already loaded and causes tha hang because something doesn't work and Xorg freezes due to messed up modesetting DDX?

bluca commented 8 years ago

With the bumblebee-nvidia package nouveau is blacklisted, so it can't be loaded.

karolherbst commented 8 years ago

and I hope nvidia is also blacklisted, but Xorg freezes and that usually happens for a bad reason.

My guess is: X loads the nvidia DDX, which autoloads the nvidia kernel driver.

bluca commented 8 years ago

Yes, all the kernel modules are blacklisted. And the nvidia libraries are out of the path (hence my question earlier about update-alternatives).

karolherbst commented 8 years ago

I dealt with so many users where something was messed up, that I wouldn't rely on anything here. And that nvidia gets loaded also explains why turning the GPU off helps.

In fact for that the nvidia libraries doesn'T need to be in the Path, because the nvidia ddx already is enough and for that different paths are used.

Anyhow, without logs it will be painfull to debug this.

jgkamat commented 8 years ago

I've tried w/ nouveau and I still see the same issue (but with the workaround (which worked under nouveau) I started to see some weird behavior like some CPU cores sticking at 100%). Also when running optirun I got some permission denied errors with nouveau. I'm not sure if this will help though.

Just to clarify, simply turning the discrete video card ON with bbswitch before starting X11 fixes my issue (but it is a hassle to deal with every time). I'm not sure if there are any ways for me to get logs with this situation, but if there are let me know. When I run startx, the screen freezes before any errors come up, so I'm not sure if there is much I can do.

bumblebee blacklists all the nvidia/nouveau modules by default, and I have nvidia set under the bumblebee.conf, so I think nouvau isn't conflicting? If there is any way to test this I would be happy to do so!

karolherbst commented 8 years ago

well you don't use bumblee with nouveau, and that support should be removed in bumblebee

karolherbst commented 8 years ago

@jgkamat what really would help would be the dmesg output. Maybe you can do "dmesg -w" through ssh while you start X and see if you get enough useful output this way.

bluca commented 8 years ago

If dmesg can write it, so will journalctl. If you haven't, enable persistent journal (create /var/log/journal) and then after the freeze reboot and check the previous boot journal with journalctl -b -1

karolherbst commented 8 years ago

@bluca His machine crashes completly. And on a crash usually error logs can't be written anymore, because the kernel stoped doing anything. Dmesg -w could help us because it immediatly displays messages (even before they get written to disc), but if the network dies too fast, he wouldn't either get this and need to setup netconsole, allthough this also requires a working network.

@jgkamat maybe you have something inside pstore (/sys/fs/pstore)

check here for pstore information:

https://lwn.net/Articles/434821/ https://www.kernel.org/doc/Documentation/ABI/testing/pstore https://www.kernel.org/doc/Documentation/ramoops.txt

jgkamat commented 8 years ago

I tried setting up a netconsole (and dmesg -w over ssh) and that dosen't seem to give me any logs either before the freeze. I don't have anything currently inside pstore as far as I can tell. I'm starting to think that this is some sort of race condition where bumblebee tries to turn on the nvidia driver before X starts, but X manages to start before the nvidia card comes online, leading to a lockout (or maybe my hardware can't deal with xorg starting without the nvidia card being on). (running modprobe nvidia before X also makes X start properly, as it also forces the nvidia card on).

karolherbst commented 8 years ago

@jgkamat could you add a xorg.conf file in /etc/X11 with this content and start X while the gpu is off? https://gist.github.com/karolherbst/1f1bdd1a3822df74097f

and check if your nvidia card also has the 01:00.0 address in lspci. If this works, that means something is loaded which makes your kernel unhappy.

jgkamat commented 8 years ago

Unfortunately, I'm still seeing the same issue with this config. Just to be sure, I created a new xorg.conf file (as the docs say that none should be present) with that config. My Nvidia card is on that bus. Here's the ouptut of lspci. if that helps:

00:00.0 Host bridge: Intel Corporation Sky Lake Host Bridge/DRAM Registers (rev 07)
00:01.0 PCI bridge: Intel Corporation Sky Lake PCIe Controller (x16) (rev 07)
00:02.0 VGA compatible controller: Intel Corporation Skylake Integrated Graphics (rev 06)
00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller (rev 31)
00:14.2 Signal processing controller: Intel Corporation Sunrise Point-H Thermal subsystem (rev 31)
00:16.0 Communication controller: Intel Corporation Sunrise Point-H CSME HECI #1 (rev 31)
00:17.0 SATA controller: Intel Corporation Sunrise Point-H SATA Controller [AHCI mode] (rev 31)
00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #3 (rev f1)
00:1c.3 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #4 (rev f1)
00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation Sunrise Point-H PMC (rev 31)
00:1f.3 Audio device: Intel Corporation Sunrise Point-H HD Audio (rev 31)
00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31)
01:00.0 VGA compatible controller: NVIDIA Corporation GM206M [GeForce GTX 965M] (rev a1)
02:00.0 Network controller: Intel Corporation Wireless 8260 (rev 3a)
03:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. Device 5287 (rev 01)
03:00.1 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 12)

Should that file have gone in /etc/bumblebee/xorg.conf.d instead?

Lekensteyn commented 8 years ago

I have a Clevo P650RA/P651RA (and also access to a Clevo P670RA/P671RA) which both have GTX 965M cards as well. This issue could be related to https://github.com/Bumblebee-Project/bbswitch/issues/115

In my case an infinite loop would occur in ACPI. See https://github.com/Bumblebee-Project/bbswitch/issues/115#issuecomment-218551781 for more details if you are interested.

jgkamat commented 8 years ago

I'm not seeing any issues with suspend to the best of my knowlege (the video card is off before/after a sleep, according to bbswitch, and that works fine for me). These issues could be related though.

I'm honestly pretty stoked at how well this performs (with this workaround in place). but I'm worried that a slight change could break it more. I'm happy to provide any more information if that would help!

EDIT: My laptop is a CLEVO N155RF (sager just rebrands them?)

jkehler commented 8 years ago

I've been having the exact same issue with my MSI GE62. If i start X11 with the 960M turned off it will do a hard lock. But if i turn it on first then start X11 it works fine.

I should also note that with Gnome GDM will start fine with the 960M turned off. But once I enter my password to log in to Gnome then it will do a hard lock. I presume this is because GDM is using Wayland?

Warpgamer commented 8 years ago

@jkehler : I'm having the exact same behavior with the same model, except I have a 970M Created a script that executes after GDM login that starts bumblebee. However, when manually stopping bumblebee service, half of the time it'll totally freeze the system, like it does when GDM attempts to login with discrete card off.

jkehler commented 8 years ago

Actually I had just realized I had never actually tried starting Gnome with Wayland instead of X11 to see if it hard freezes. I just tried it now and when using Wayland it worked fine with the 960M turned off. So it definitely appears to just be an issue with X11.

jgkamat commented 8 years ago

I've had a couple random freezes too. Most of the time, they are triggered by some 'low level' operations, or things involving the graphics card (eg: starting steam, modprobes, even lspci once). This is usually accompanied by some audio garbling for some reason (before hard faulting). If I enable the descrete graphics card via bbswitch then I never have this issue, however.

This is my xorg version, if that helps. I've never tried out wayland, and I don't have the time to test this right now, but If I ever do, I'll post an update here. Isn't wayland supposed to illiminate the need for bumblebee? I'm still fuzzy on that topic though...

X.Org X Server 1.18.3
Release Date: 2016-04-04
X Protocol Version 11, Revision 0
Build Operating System: Linux 3.16.0-4-amd64 x86_64 Debian
Current Operating System: Linux laythe 4.5.0-2-amd64 #1 SMP Debian 4.5.3-2 (2016-05-08) x86_64
Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.5.0-2-amd64 root=UUID=50a03efa-01f3-4e94-92a9-d4ad458845f0 ro acpi_enforce_resources=lax
Build Date: 05 April 2016  07:00:43AM
xorg-server 2:1.18.3-1 (http://www.debian.org/support) 
Current version of pixman: 0.33.6
    Before reporting problems, check http://wiki.x.org
    to make sure that you have the latest version.

Lekensteyn commented 8 years ago

I think X is not much of an issue, but a trigger.

Can you switch to a TTY (Ctrl-Alt-F2), log in and try to power off/on the card manually using bbswitch? Repeat this twice to see if it makes a difference.

sudo tee /proc/acpi/bbswitch <<<OFF
sudo tee /proc/acpi/bbswitch <<<ON
sudo tee /proc/acpi/bbswitch <<<OFF
sudo tee /proc/acpi/bbswitch <<<ON

If that still does not hang, try this (exact output does not matter, only whether it hangs or not):

sudo lspci -vvvs 00:01:0
sudo lspci -vvvs 01:00:0

My guess is that trying to access some PCI configuration registers too fast results in failure. Why exactly this happens is something I have been trying for a week to figure out on a Clevo P651RA/GTX965M. Current key words: PCIe link training failure.

Warpgamer commented 8 years ago

Hello @Lekensteyn switching gpus manually causes no issue. Both commands below do not hang either, though second one produces no output at all.

However, I've found that if I disable discrete card at boot with bbswitch, the system won't properly boot; on loading gnome, in freezes; visual artifacts in the console may appear at freeze instant, and nothing but power button answers. All this while being on the integrated intel card.

Warp

jkehler commented 8 years ago

@Lekensteyn I finally got around to trying what you had suggested above. Switching to a TTY and repeatedly turning the GPU on and off did not result in any sort of hard lock for me.

But when I ran your second set of commands the first one outputted the following.

01:00.0 3D controller: NVIDIA Corporation GM107M [GeForce GTX 960M] (rev ff) (prog-if ff)
    !!! Unknown header type 7f
    Kernel modules: nouveau, nvidia_drm, nvidia

The 2nd command didn't output anything. But then I ran the first command a 2nd time and it resulted in a hard-lock for me.

Warpgamer commented 8 years ago

@jkehler Personally I boot with parameter rdblacklist=nouveau, and I don't have that issue. We don't have the exact same card model though.

carlinux commented 8 years ago

Same problems here. it wont boot with Discrete Graphics OFF, it totally hangs and doing the lspci freeze the system in certain situations.
This is a skylake MSI , with a 970m.

Ill check the laptop DSDT/SDST later to try to find the _OFF/ methods of the nvidia pci.

Is there anything i can do to help with this?

jgkamat commented 8 years ago

I personally think this is an issue in bbswitch, rather than optimus, but I'm not sure...

I'm also free to test if anyone has any way to debug hard faults 😢

Lekensteyn commented 8 years ago

@Warpgamer If you disable nouveau and bbswitch your battery life will drain 2-3x as fast, the fan might spin more often and the heat increases. Writing OFF to bbswitch has no effect if nouveau or nvidia are loaded, check the dmesg output for such events.

@carlinux Can you report the full MSI model and acpidump? If you are affected by the same issue as the Clevo P6xxRxx models, then you can try booting with acpi_osi="!Windows 2015" to disable the faulty firmware code path.

carlinux commented 8 years ago

@Lekensteyn I'll try that, thanks. The model is MSI GS40 6QE Phantom And about the acpidump. I guess you're asking for the DSDT and the SDST with the methods right? attached here: DLS.zip

And as a workaround for having a working laptop.. I already modified succesfully a DSDT and injected it with Clover in a Hackintosh instalation to deactivate the nvidia card for good.
Attached: DSDT_noNVidia.dsl.zip

As far as i know the same acpi methods and fixes should work on a Linux machine but I don't know how I could inject/execute them. Is there a way to launch my own DSDT in a linux machine? i'll investigate about it but .. i ask here anyway. Thanks

Lekensteyn commented 8 years ago

@carlinux Patching DSDT like that should not be needed for Linux. It is possible, but your kernel will be marked as tainted. I found all ACPI tables in the BIOS from https://www.msi.com/Laptop/support/GS40-6QE-Phantom.html (E14A1IMS.10D) and matched those against your DSDT/SSDT files. The methods look like https://github.com/Bumblebee-Project/bbswitch/issues/134.

Have you tried using nouveau instead of bbswitch? If the problem persists with nouveau, could you apply https://lekensteyn.nl/files/linux-v4.6-pcipm-nouveau-pm2.patch on top of Linux 4.6 and try nouveau again?

sylvio-neto commented 8 years ago

Same thing here: 1 - GDM + Wayland starts without any freeze 2 - Starting X with "tee /proc/acpi/bbswitch <<<ON" solves the problem 3 - Starting X without that command freezes with a hard lock; 4 - No logs in any output, just hard reset. 5 - Linux arch 4.6.3-1-ARCH #1 SMP PREEMPT Fri Jun 24 21:19:13 CEST 2016 x86_64 GNU/Linux 6 - SchenkerXMG P506 (clevo) = Nvidia GTX 970 + Intel Skylake 7 - Intel microcode loaded: revision 0x8a, date 06.04.2016

Zipristin commented 8 years ago

Thanks @Lekensteyn ! I have a Clevo P650RE6 with a 970m. Booting with acpi_osi="!Windows 2015" was the thing that fixed the freezes and hard-locks for me with bumblebee. After months of headaches now I'm able to use my optimus laptop without windows 10. My laptop has the latest bios.

I tried it and I've been playing Talos Principle with primusrun inside a Manjaro Live usb session (Manjaro comes with working bumblebee out of the box) without any issues.

jgkamat commented 8 years ago

@Zipristin That fixed it for me too! You are officially the best person on the internet! :smile:

I don't really have any idea how this works, but it would be nice if this could somehow be worked around within bumblebee (but I don't have high hopes, because this is a kernel option). As of now, the ubuntu 16.04 live disk hard locks for me due to this (in default mode).

(as a side note, running nvidia-smi without optirun hard locked for me too, and with acpi_osi="!Windows 2015", that errors properly, so this looks like it will solve my intermittent hard locks as well).

Zipristin commented 8 years ago

@jgkamat Booting default ubuntu 16.04 live freezes for me if I boot without nouveau.modeset=0. My boot options are nouveau.modeset=0 acpi_osi=Linux acpi_osi="!Windows 2015"

I would like to know too how this acpi_osi option really works but anyway it looks a bios/firmware bug in the laptop more than a bumblebee bug.

jkehler commented 8 years ago

I tried using the acpi_osi="!Windows 2015" option on my laptop and it did not fix the issue for me. I still get hard locks when starting X11 with the Nvidia turned OFF. I presume this fix only works for the Clevo laptops since mine is a MSI GE62 Skylake.

jkehler commented 8 years ago

Also here is some interesting lines I found in my boot log. I'm not sure if they are helpful/relevant for isolating the source of the problem.

Jul 17 15:54:11 arch kernel: ACPI: Video Device [GFX0] (multi-head: yes  rom: no  post: no)
Jul 17 15:54:11 arch kernel: input: Video Bus as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/LNXVIDEO:00/input/input11
Jul 17 15:54:11 arch kernel: ACPI Exception: AE_NOT_FOUND, Evaluating _DOD (20160108/video-1241)
Jul 17 15:54:11 arch kernel: ACPI: Video Device [PEGP] (multi-head: no  rom: yes  post: no)
Jul 17 15:54:11 arch kernel: input: Video Bus as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:12/LNXVIDEO:01/input/input12

...

Jul 17 15:54:12 arch kernel: bbswitch: version 0.8
Jul 17 15:54:12 arch kernel: bbswitch: Found integrated VGA device 0000:00:02.0: \_SB_.PCI0.GFX0
Jul 17 15:54:12 arch kernel: bbswitch: Found discrete VGA device 0000:01:00.0: \_SB_.PCI0.PEG0.PEGP
Jul 17 15:54:12 arch kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95)
Jul 17 15:54:12 arch kernel: bbswitch: detected an Optimus _DSM function
Jul 17 15:54:12 arch kernel: pci 0000:01:00.0: enabling device (0006 -> 0007)
Jul 17 15:54:12 arch kernel: bbswitch: Succesfully loaded. Discrete card 0000:01:00.0 is on
Jul 17 15:54:12 arch bumblebeed[565]: [    4.778685] [INFO]/usr/bin/bumblebeed 3.2.1 started
Jul 17 15:54:12 arch kernel: bbswitch: disabling discrete graphics
Jul 17 15:54:12 arch kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95)

Lekensteyn commented 8 years ago

@Zipristin The Ubuntu 16.04 kernel might be a bit too old for nouveau and your new hardware. The acpi_osi="!Windows 2015" line works around a firmware incompatibility with Linux (still investigating how to solve this).

@jkehler All messages looks normal (the type mismatch for DSM can be ignored). Since you mentioned a GTX 960M and MSI GE62, I take you refer to the MSI GE62 Apache Pro (6th gen, GTX 960M). If you have not already, can you post the tar.gz following the instructions on https://bugs.launchpad.net/lpbugreporter/+bug/752542?

jkehler commented 8 years ago

@Lekensteyn I've uploaded it to the launchpad page and I will also upload it here.

The exact model page is https://www.msi.com/Laptop/GE62-6QD-APACHE-PRO.html#hero-overview

Micro-Star_International_Co.,_Ltd.-GE62_6QD.tar.gz

george-harwell commented 8 years ago

@Lekensteyn @jkehler I also have a MSI ge62 with a 960M experiencing the exact same behaviour as jkehler. I have tried the same fixes with the same failed results as well and I get the same lockup behaviour running lspci twice (hard lock on 2nd time).

sylvio-neto commented 8 years ago

@Zipristin Thank you very much for your fix. It worked for me after I realized that the syntax should be different in Arch Linux. The system was refusing to build the grub configuration. My system is a Clevo, maybe it just fix it for this line. I would like to share my grub cmdline to show the syntax that worked for me:

GRUB_CMDLINE_LINUX_DEFAULT="acpi_osi=\"!Windows 2015\" rcutree.rcu_idle_gp_delay=1 intel_iommu=on"

The last part is just for VT-X processors and virtualization. I also enabled the latest revision of Intel microcode.

Thank you very much, All the best,

Lekensteyn commented 8 years ago

@jgkamat This issue gets a bit overloaded with different laptops... sorry for that. Can you also follow the instructions on https://bugs.launchpad.net/lpbugreporter/+bug/752542 for obtaining the required information?

@jkehler It looks like your MSI GE62 Apache Pro has a similar PGON function definition, except that it does not run into an infinite loop. You shouldn't be seeing AML_INFINITE_LOOP, but I expect that your card will stay off (causing lock ups in the nouveau and possibly nvidia drivers). Can you provide a full dmesg when this occurs? And unfortunately the Clevo workaround does not work for you, the other code could be triggered if you somehow force that Windows 2009 (Win7) is the highest reported value for acpi_osi. Possibly by acpi_osi="!Windows 2012" acpi_osi="!Windows 2013" acpi_osi="!Windows 2015" (disabling Win 8, 8.1 and 10).

jgkamat commented 8 years ago

Here is the file output, let me know if you need anything else! Notebook-N15_17RF.tar.gz

Also regarding quotes, I used single quotes around double quotes: GRUB_CMDLINE_LINUX_DEFAULT='acpi_osi="!Windows 2015"'.

jkehler commented 8 years ago

@Lekensteyn I'm not exactly sure how I can provide you a full dmesg if the laptop is doing a hard lock.

However, I tried your suggestion of forcing acpi_osi to Windows 2009 only by using acpi_osi=! acpi_osi=Windows 2009 and I am no longer getting a hard lock when starting X11 with the Nvidia turned OFF.

Thanks a bunch for the suggestion! Is there any major caveats to this workaround though that you know of?

george-harwell commented 8 years ago

@jkehler This just worked for me as well! I am so happy! Also @sylvio-neto ignore what I said earlier about the ! in grub I was completely misunderstanding the documentation. (english not my first language). ! means remove things but without it adds them or something.

Thank you @Lekensteyn !!!

Lekensteyn commented 8 years ago

@jgkamat Based on your acpidump, I can confirm that the same issue exists on your Clevo N155RF laptop and that the workaround acpi_osi="!Windows 2015" will also work for you (as you reported before). I took the liberty to post it to the DSDT bug as well.

@jkehler Do you also have a hard lockup when logging into a console with the nouveau driver (not the nvidia blob)? Can you try to reproduce it without X as that touches so many things in the stack that a hard lockup is more likely to happen. So, switch to a console, modprobe nouveau, wait for at least five secs for the runtime PM to kick in. Then execute lspci -d10de: (which might cause a hang of the command, but executing, say, dmesg > dmesg.txt; sync should still be possible).

As for side-effects, maybe there are some other code paths that are less efficient, but normally it should not be too bad. This is really a workaround until the root cause is found (currently comparing PCI config space dumps from Windows with Linux, hopefully that yields something).

DewaldV commented 8 years ago

I'm having the exact same issue on a different make and model, a Gigabyte P35W V5.

It's a similar situation with Skylake integrated graphics and nVidia Geforce 970M dedicated and if I turn the nvidia card off I get a hard-lock on login via GDM. I'm running Fedora 24 with the latest Kernel, 4.6.4.

I've tried excluding Windows 10 with acpi_osi="!Windows 2015" but I got the exact same lockup with the nvidia card OFF at login so I suspect it's a slightly different issue.

I've included the generated dump from the launchpad bug page in hope that it helps shine more light on this issue.

GIGABYTE-P35V5.zip

Bumblebee-Project / Bumblebee

Laptop freezes when starting X11 and discrete graphics are OFF #764

bumblebee.conf

xorg.conf.nvidia