Bumblebee-Project / bbswitch

Disable discrete graphics (currently nvidia only)
GNU General Public License v2.0
488 stars 78 forks source link

T440s with kernel >= 3.15 doesn't power off properly (Analysis and possible solution included !) #112

Open smunaut opened 9 years ago

smunaut commented 9 years ago

Hi,

So I'm using a T440s with a modern kernel that reports "Windows 2013" compatibility (and soon Windows 2015 with kernel 4.2). This breaks bbswitch because of some changes this triggers in the ACPI table. Manually overriding acpi_osi does fix the issue and allow the current bbswitch to work, but what I'm looking it here is how to make it work with the "new" Win 8.1 method of shutting down the card.

So the main symptoms of the issue are :

This is the DSDT table from the T440s with the latest bios (which even has Windows 10 support) : http://pastebin.com/raw.php?i=C6Q3A8aa

The important thing to note is that when "Windown 2013" string is found, then OSYS is set to 0x07DD. This in turn cause VMSH to be set to 1. This in turn causes SB.PCI0.PEG.VID._PS3 to NOT call GPOF ... and so the card is never really turned off completely.

Now if you look at how GPOF can be called, you can see it will be called as part of NVP3 power resource which is _PR3 ... but on the node SB.PCI0.PEG_ and not SB.PCI0.PEG.VID !

So basically you need to put the PCIe root port (parent pci device) in D3 and not just the card.

I tested this and it indeed triggered the proper expected power saving and seemed to behave exactly like if I tweaked acpi_osi.

ArchangeGabriel commented 8 years ago

Yes, so this probably explain that we have a lot of people reporting issue after going from 3.14 to 3.16 and other ones after going to 4.1.

klebed commented 8 years ago

Does that mean, that lenovo bios disabling something, if we are not telling him, that we are booting windows 8 or something else happened?

ArchangeGabriel commented 8 years ago

Yes, the BIOS disable some functions if you don’t tell it that you’re booting the latest Windows.

klebed commented 8 years ago

And then the solution is to add acpi_osi="Windows 2013" in kernel boot parameters? I've seen some reports, that people getting fs corruption and crashes then. Is it worth to do?

ArchangeGabriel commented 8 years ago

This is automatically added since kernel 3.15. So you should either add acpi_osi="!Windows 2013" (not recommended) or try @smunaut patch (look at https://github.com/Bumblebee-Project/bbswitch/issues/112#issuecomment-124669413).

klebed commented 8 years ago

Ok... I've tested hacked version of bbswitch. At least now it seems like working.

Here is the logs, while running glxgears with optirun and exiting then:

[   17.180425] thinkpad_ec: thinkpad_ec_request_row: arg0 rejected: (0x01:0x00)->0x00
[   17.180429] thinkpad_ec: thinkpad_ec_read_row: failed requesting row: (0x01:0x00)->0xfffffffb
[   17.180431] thinkpad_ec: initial ec test failed
[   17.215779] thinkpad_ec: thinkpad_ec_request_row: arg0 rejected: (0x01:0x00)->0x00
[   17.215782] thinkpad_ec: thinkpad_ec_read_row: failed requesting row: (0x01:0x00)->0xfffffffb
[   17.215784] thinkpad_ec: initial ec test failed
[  116.236037] bbswitch: enabling discrete graphics
[  116.508646] thinkpad_acpi: EC reports that Thermal Table has changed
[  116.585041] nvidia: module license 'NVIDIA' taints kernel.
[  116.585045] Disabling lock debugging due to kernel taint
[  116.594804] vgaarb: device changed decodes: PCI:0000:04:00.0,olddecodes=io+mem,decodes=none:owns=none
[  116.595109] [drm] Initialized nvidia-drm 0.0.0 20150116 for 0000:04:00.0 on minor 1
[  116.595114] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  340.96  Sun Nov  8 22:33:28 PST 2015
[  116.858245] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20141107/nsarguments-95)
[  116.858321] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20141107/nsarguments-95)
[  116.858373] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20141107/nsarguments-95)
[  116.858425] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20141107/nsarguments-95)
[  116.858475] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20141107/nsarguments-95)
[  116.858663] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20141107/nsarguments-95)
[  116.858830] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20141107/nsarguments-95)
[  116.858883] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20141107/nsarguments-95)
[  116.883277] ACPI Error: Field [TBF3] at 524288 exceeds Buffer [NULL] size 262144 (bits) (20141107/dsopcode-236)
[  116.883280] ACPI Error: Method parse/execution failed [\_SB_.PCI0.PEG_.VID_.GETB] (Node ffff8803318b4be0), AE_AML_BUFFER_LIMIT (20141107/psparse-536)
[  116.883285] ACPI Error: Method parse/execution failed [\_SB_.PCI0.PEG_.VID_._ROM] (Node ffff8803318b4bb8), AE_AML_BUFFER_LIMIT (20141107/psparse-536)
[  116.890749] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20141107/nsarguments-95)
[  117.064022] thinkpad_acpi: asked for hotkey mask 0x0070ffbf, but firmware forced it to 0x0070ffbb
[  124.680820] [drm] Module unloaded
[  124.726666] bbswitch: disabling discrete graphics
[  124.726675] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20141107/nsarguments-95)
[  124.741243] pci_raw_set_power_state: 51 callbacks suppressed
[  124.741247] pci 0000:04:00.0: Refused to change power state, currently in D0
[  124.757917] thinkpad_acpi: EC reports that Thermal Table has changed
ArchangeGabriel commented 8 years ago

@smunaut What can tell if it works here? Does /proc/acpi/bbswitch correctly see the results (i.e. OFF with the hack, ON without)?

klebed commented 8 years ago
user@lenovo-T440s:~$ cat /proc/acpi/bbswitch
0000:04:00.0 OFF

*here I've started glxgears in another terminal

user@lenovo-T440s:~$ cat /proc/acpi/bbswitch
0000:04:00.0 ON
user@lenovo-T440s:~$ cat /proc/acpi/bbswitch
0000:04:00.0 ON

*here I've stopped glxgears in another terminal

user@lenovo-T440s:~$ cat /proc/acpi/bbswitch
0000:04:00.0 OFF

And then having some tee:

user@lenovo-T440s:~$ sudo tee /proc/acpi/bbswitch <<<ON
ON
user@lenovo-T440s:~$ cat /proc/acpi/bbswitch
0000:04:00.0 ON
user@lenovo-T440s:~$ sudo tee /proc/acpi/bbswitch <<<OFF
OFF
user@lenovo-T440s:~$ cat /proc/acpi/bbswitch
0000:04:00.0 OFF
smunaut commented 8 years ago

Having lspci show "rev ff" for the card has been the best indicator of a true OFF state for me.

ArchangeGabriel commented 8 years ago

OK, nice to see. So at least they are some progress for you here.

ArchangeGabriel commented 8 years ago

@smunaut That’s what bbswitch is looking at normally.

klebed commented 8 years ago

Yep, and idle temperature dropped to +44.0°C approx. Seems like we reduced some carbon dioxide emissions here. :)

ruffe972 commented 8 years ago

Sorry, I don't understand a thing, but is my problem related to this bug? The problem: optirun enables nvidia card, but fails to disable it later. Enabling/disabling nvidia manually works, though. Arch linux, latest kernel. I have acpi_osi= (it's blank after the '=') for my laptop brightness keys to work.

emanuil-tolev commented 8 years ago

And you have a Lenovo Thinkpad?

ruffe972 commented 8 years ago

@emanuil-tolev: I have ASUS N550JV.

emanuil-tolev commented 8 years ago

Hmm. It is possible you would be affected by the same ACPI update problems which plagued Thinkpad T440s and T450s (and presumably their cousin T440, T450 and who knows how many more Lenovo series). If you don't have the time to get into whether your machine had the same problem as these Thinkpads, I would attempt following the instructions on the forked branch: https://github.com/smunaut/bbswitch/tree/hack-t440s and checking dmesg | grep bbswitch.

Actually, you're saying optirun fails to disable your card "later" - do you see evidence of the turning off process failing if you do dmesg | grep bbswitch after using optirun?

Notice this part of the original post:

Using vanilla bbswith : Switching OFF does put it to D3, but it's not entirely off. The power usage is lower but not as low as it should be (in my case, about 1W lower than full power while it should be more like 2.5W). Also, bbswith reports the card as "ON" because lspci still works fine and you can still access the config space. There is no way to turn the card back ON and so loading nvidia driver will actually crash trying to access a card that's only half powered.

If you are using vanilla bbswitch (e.g. you installed bumblebee from your Arch repos and the package maintainer is not using @smunaut's fork to compile the Arch version, but rather just using Bumblebee-Project/bbswitch like Ubuntu does), then you should see those symptoms specifically. It sounds like it's possibly the same problem as "your card stays on" - I presume you mean /cat/proc/bbswitch says "ON" after optirun has finished. Trying it out and seeing it go "OFF" would confirm this.

ruffe972 commented 8 years ago

"do you see evidence of the turning off process failing if you do dmesg | grep bbswitch after using optirun?" - Yes. I mean before running "optirun firefox" dmesg | grep bbswitch says "disabling graphics card". When this instance of Frefox is running, dmesg says "enabling graphics card". I close firefox, but the last line of dmesg is "enabling..." and /proc/acpi/bbswitch prints ON, not OFF. "Trying it out and seeing it go "OFF" would confirm this." - By trying it out do you mean trying smunaut's hack?

morj commented 8 years ago

@smunaut hello! Any chance to have your patched repository back until we have this fix?

smunaut commented 8 years ago

!?!? WTF ... I never deleted that repo.

smunaut commented 8 years ago

Does anyone have a copy ? I don't have a local checkout at all ...

doudou commented 8 years ago

'disable_root_port' branch at https://github.com/doudou/bbswitch

smunaut commented 8 years ago

tx !

Valentin-N commented 8 years ago

Ubuntu 14.04.4 now ships with Kernel 4.2, which means the previous fix (acpi_osi='!Windows 2013') no longer works. Are we any closer to releasing a fix for this issue?

[later edit] Replying to myself so others don't have to search. For Kernel 4.2 all I had to do was to replace acpi_osi='!Windows 2013' with acpi_osi=Linux in the GRUB boot parameters.

NikolausDemmel commented 8 years ago

@valneacsu: Could you quickly summarize what exactly "works" for you with acpi_osi=Linux and what else you needed to do? Do you use the patched bbswitch from https://github.com/doudou/bbswitch/tree/disable_root_port ?

Valentin-N commented 8 years ago

Sorry, I did not mention that I am using bbswitch v0.8 from the Bumblebee PPA. Other than that I did not have to do anything special, but I've also been using nvidia-355 for some time.

smunaut commented 8 years ago

If you make Linux not advertise any "Windows" string to ACPI (using acpi_osi=Linux) then the disable_root_port shouldn't be necessary. It basically uses the "old" switch off method path in the ACPI bios code.

NikolausDemmel commented 8 years ago

Ah ok, I didn't get that yet. And with that you can turn the card ON and OFF and it also turns off completely?

Valentin-N commented 8 years ago

I haven't measured the power consumption, but both /proc/acpi/bbswitch and 'lspci' show the card as off.

NikolausDemmel commented 8 years ago

Thanks for the explanations!

wuqso commented 8 years ago

I have a Lenovo T450s with Fedora 23.

#uname -r
4.4.6-300.fc23.x86_64

I downloaded doudou's bbswitch code from https://github.com/doudou/bbswitch, build it with 'make' and load it with 'make load' with no error. But the Nvidia card is still on.

# cat /proc/acpi/bbswitch
0000:04:00.0 ON
# tee /proc/acpi/bbswitch <<< OFF
OFF
# cat /proc/acpi/bbswitch
0000:04:00.0 ON
# lspci -vnn | grep '\''[030[02]\]'
00:02.0 VGA compatible controller [0300]: Intel Corporation Broadwell-U Integrated Graphics [8086:1616] (rev 09) (prog-if 00 [VGA controller])
04:00.0 3D controller [0302]: NVIDIA Corporation GM108M [GeForce 940M] [10de:1347] (rev a2)

What's the problem? I'm a user rather than a programmer. I just want a long battery life in linux system.

wuqso commented 8 years ago

@valneacsu, using acpi_osi=Linux solves my problem. Thank you!

Lekensteyn commented 8 years ago

For those who are wondering why the methods are different depending on the kernel version (or the setting of acpi_osi), apparently Windows 10 (Windows 2015) drops the use of DSM completely. For that OS you have to toggle the power resources (basically the parent PCIe port).

David Airlie posted a patch to handle the power resources for nouveau/vgaswitcheroo, but I don't know what happened to those patches (https://lkml.org/lkml/2016/3/9/65). You could probably check the power resources of the parent device, but that smells hacky if you call it directly (shouldn't the Linux PM take care of this?). See also the analysis at https://github.com/Bumblebee-Project/bbswitch/issues/115#issuecomment-218622306. While turning off has no DSM methods, surprisingly there are some involved with turning it on. Can you reproduce/match this with your machines?

(Will comment later on the patch, my battery is dieing)

doudou commented 8 years ago

David Airlie's patch definitely looks interesting. Even more so because the PM handler is exported, and could therefore be used outside of vgaswitcheroo-enabled drivers.

doudou commented 8 years ago

If you look at the LKML thread, it seems that the kernel devs are tackling the root problem ... that is adding runtime PM to the PCIe root ports. What it means for the future of bbswitch, I'm not sure, since it will work only if the card itself is put in D3 first.

While turning off has no DSM methods, surprisingly there are some involved with turning it on. Can you reproduce/match this with your machines?

On mine, the NVidia suspend/resume with pcie-root-port in D3 works without any calls to any DSM.

Lekensteyn commented 8 years ago

I read the full thread (and some of the linked patchwork entries). David's patch adds a function to enabble PM operations that power off/on the parent (PCIe port) device and hooks it into nouveau. Rafael commented that the patches PM for PCIe ports are still under discussion. Especially note:

I'm guessing on Windows this all happens automatically.

PCIe ports are power-managend by (newer) Windows AFAICS, but we know for a fact that this simply doesn't work reliably on some older hardware which is why we don't do that. I suppose that the Windows in question uses a cut-off date or similar to decide what do do with PCIe ports PM.

Edit: latest version of PCIe patches are scheduled for v4.7, see http://article.gmane.org/gmane.linux.power-management.general/75997 and https://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/log/?h=pci/pm

As for bbswitch's future, it is currently broken for newer devices so I'll probably do this:

In case upstream Linux adds proper PM to the PCIe root port, then bbswitch should not interfere with that, so there also needs to be checks for that (PM domains?). Some reading material: https://www.kernel.org/doc/Documentation/power/devices.txt

bbswitch differentiates itself from nouveau in that it will always keep the device powered off unless explicitly asked by the user (via /proc/acpi/bbswitch). Currently nouveau will flip the card on when you execute lspci for example or open /dev/dri/.... Note that I will probably use nouveau though since external screens are connected via the Nvidia card on my laptop.

On mine, the NVidia suspend/resume with pcie-root-port in D3 works without any calls to any DSM.

Did you observe this on Windows? Is Windows doing any DSM calls after D0?

doudou commented 8 years ago

Currently nouveau will flip the card on when you execute lspci for example or open /dev/dri/....

This is due to the usage of runtime PM. lspci triggers a wakeup. I had a version of bbswitch that was purely relying on runtime PM and noticed that.

Once all the PCIe root port work is in the kernel, I'm planning try to use nouveau instead of bbswitch just for the PM work. I still rely on the nvidia proprietary driver for proper opengl support.

Did you observe this on Windows?

I would not even begin to know how I could check that.

shadoxx commented 8 years ago

Just wanted to confirm this issue on a Thinkpad W550s running BIOS version 1.14. bbswitch worked fine on the BIOS that was shipped with this laptop which was version 1.06. I would downgrade the BIOS, but Lenovo has disabled that functionality as well, citing security concerns.

Lekensteyn commented 8 years ago

@shadoxx https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1452979 lists acpidump for a W550s (Quadro K620M, BIOS 1.02). The SSDT changes are not relevant. DSDT has changes a bit, mainly wrt USB 3.0, some TPM changes. These are the hunks relevant for graphics (GPON, GPOF and Win10 detection):

diff --git a/acpi-1.02/dsdt.dsl b/acpi-1.14/dsdt.dsl
index 53b2c77..9717867 100644
--- a/acpi-1.02/dsdt.dsl
+++ b/acpi-1.14/dsdt.dsl
@@ -1051,24 +1051,30 @@ DefinitionBlock ("dsdt.aml", "DSDT", 1, "LENOVO", "TP-N11  ", 0x00001020)
                 If (\_OSI ("Windows 2012"))
                 {
                     \WIN8 = 0x01
                     OSYS = 0x07DC
                 }

                 If (\_OSI ("Windows 2013"))
                 {
                     \WIN8 = 0x01
                     OSYS = 0x07DD
                 }

+                If (\_OSI ("Windows 2015"))
+                {
+                    \WIN8 = 0x01
+                    OSYS = 0x07DF
+                }
+
                 If (\_OSI ("Linux"))
                 {
                     \LNUX = 0x01
                     OSYS = 0x03E8
                 }

                 If (\_OSI ("FreeBSD"))
                 {
                     \LNUX = 0x01
                     OSYS = 0x03E8
                 }
             }
@@ -6774,32 +6770,36 @@ DefinitionBlock ("dsdt.aml", "DSDT", 1, "LENOVO", "TP-N11  ", 0x00001020)
                         {
                             GPOF (0x00)
                         }
                     }

                     Method (GPON, 1, NotSerialized)
                     {
                         If (ISOP ())
                         {
                             If (DGOS)
                             {
                                 \VHYB (0x02, 0x00)
-                                Sleep (0x64)
+                                Sleep (0x14)
                                 If ((ToInteger (Arg0) == 0x00)) {}
                                 \VHYB (0x00, 0x01)
-                                Local0 = 0x00
-                                While ((Local0 < 0x5A))
+                                Sleep (0x14)
+                                Local2 = \VHYB (0x0E, 0x00)
+                                While ((Local2 != 0x0F))
                                 {
-                                    Local0 += One
-                                    Stall (0x64)
+                                    \VHYB (0x00, 0x00)
+                                    Sleep (0x14)
+                                    \VHYB (0x00, 0x01)
+                                    Sleep (0x0A)
+                                    Local2 = \VHYB (0x0E, 0x00)
                                 }

                                 \VHYB (0x02, 0x01)
                                 Sleep (0x01)
                                 \VHYB (0x08, 0x01)
                                 Local0 = 0x0A
                                 Local1 = 0x32
                                 LREN = LTRS /* \_SB_.PCI0.PEG_.LTRS */
                                 CEDR = One
                                 While (Local1)
                                 {
                                     Sleep (Local0)
@@ -6851,31 +6851,25 @@ DefinitionBlock ("dsdt.aml", "DSDT", 1, "LENOVO", "TP-N11  ", 0x00001020)

                     Method (GPOF, 1, NotSerialized)
                     {
                         If (ISOP ())
                         {
                             If ((VMSH || (\_SB.PCI0.PEG.VID.OMPR == 0x03)))
                             {
                                 LTRS = LREN /* \_SB_.PCI0.PEG_.LREN */
                                 \SWTT (0x00)
                                 \VHYB (0x08, 0x00)
                                 \VHYB (0x08, 0x02)
                                 \VHYB (0x02, 0x00)
-                                Local0 = 0x00
-                                While ((Local0 < 0x1E))
-                                {
-                                    Local0 += One
-                                    Stall (0x64)
-                                }
-
+                                Sleep (0x09)
                                 \VHYB (0x00, 0x00)
                                 If ((ToInteger (Arg0) == 0x00)) {}
                                 DGOS = One
                                 \_SB.PCI0.PEG.VID.OMPR = 0x02
                             }
                         }
                     }

                     Method (_STA, 0, NotSerialized)  // _STA: Status
                     {
                         Return (0x0F)
                     }

The ACPI changes do not look very significant, it is mainly the timing that are changed. GPOF went from 30×100µs(=3ms) to 9ms. GPON is slightly more interesting. It changes the timings and adds a SMI call (unknown function). Maybe there were other non-ACPI tunings.

Anyway, maybe things get better when the power resources are used instead of DSM.

Lekensteyn commented 8 years ago

Could someone try some patch series on top of v4.7 with nouveau? See https://github.com/Bumblebee-Project/bbswitch/issues/78#issuecomment-223072012

Lekensteyn commented 8 years ago

FYI, this has been fixed for the nouveau in Linux v4.8-rc1, bbswitch still needs an update though.

BernardoGO commented 8 years ago

@Lekensteyn Does it fix the suspension problem for Clevos? BBswitch currently does not support graphic switch using nouveau, right?

Lekensteyn commented 8 years ago

@BernardoGO What suspend problem? System sleep or runtime suspend? With nouveau, you do not need bbswitch as it is able to handle PM.

klebed commented 7 years ago

Just to leave another footprint in this epic issue.

If somebody using ubuntu 16.04 (and probably other distros with >4.4.X), and used nvidia-361, it was force-replaced with nvidia-367 (you install 361, but apt gives you 367 with 361 simultaneously), so you could experience all problems again. First of all, use up to date bumblebee and primus from bumblebee/testing ppa, install nvidia-367. And then, after it started to work, check bbswitch, since it probably stopped working properly. The solution is to edit /etc/default/grub , adding acpi_osi=Linux to GRUB_CMDLINE_LINUX parameter, and then do: sudo update-grub .

Worked with Lenovo T440s.

RoadToDream commented 7 years ago

Thanks to @smunaut and @klebed . Worked for me.

Lacrymology commented 5 years ago

just an update, on a ThinkPad T450s, the card is constantly on and the grub fix above doesn't work (plasma never finishes loading and a bunch of error messages show up in the logs).

wuqso commented 5 years ago

Dear Tomas Neme,

Thank you for your email. I have already successfully set up bumblebee on the T450s.

Best, Jianghua Wu

-----原始邮件----- 发件人:"Tomas Neme" notifications@github.com 发送时间:2019-04-25 19:55:10 (星期四) 收件人: Bumblebee-Project/bbswitch bbswitch@noreply.github.com 抄送: wuqso jhwu@bnu.edu.cn, Comment comment@noreply.github.com 主题: [SPAM] Re: [Bumblebee-Project/bbswitch] T440s with kernel >= 3.15 doesn't power off properly (Analysis and possible solution included !) (#112)

just an update, on a ThinkPad T450s, the card is constantly on and the grub fix above doesn't work (plasma never finishes loading and a bunch of error messages show up in the logs).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.