GTX970m still on and overheating while suspended

BernardoGO commented 9 years ago

I have a Clevo P650SE-A(Sager NP8651) with NVidia GeForce GTX970m and Intel HD5600, the laptop does have a LED that shows if the discrete GPU is being used or not. I got bumblebee successfully working on both Ubuntu 14 and 15, I have noticed that while turning on, off or going to sleep the laptop turns on the dGPU LED. The problem is that it leaves the GeForce on while sleeping. Which leads to overheat with the fans off and a waste of 15-20% per hour while sleeping. It is actually heating and using more battery while off than it does on.

Hibernating is not an option because it cannot resume after hibernation(black screen with cursor) and I don't really want it to hibernate since I'm used to suspend it many times per day.

This happens with Ubuntu 14 and 15. I'm using the 3.19 kernel because the 4.2 does not seems to support my video card. The sleep problem does not happens on Windows using optimus.

I'm not using UEFI, does it have something to do with it?

[    6.003280] init: plymouth-upstart-bridge main process (266) terminated with status 1
[    6.003337] init: plymouth-upstart-bridge main process ended, respawning
[    6.004558] init: plymouth-upstart-bridge main process (267) terminated with status 1
[    6.004614] init: plymouth-upstart-bridge main process ended, respawning
[    6.005717] init: plymouth-upstart-bridge main process (269) terminated with status 1
[    6.005774] init: plymouth-upstart-bridge respawning too fast, stopped
[   13.951982] Adding 15904764k swap on /dev/sda5.  Priority:-1 extents:1 across:15904764k FS
[   14.077132] systemd-udevd[369]: starting version 204
[   14.167793] lp: driver loaded but no devices found
[   14.174461] [drm] Initialized drm 1.1.0 20060810
[   14.185978] bbswitch: module verification failed: signature and/or  required key missing - tainting kernel
[   14.186065] bbswitch: version 0.7
[   14.186068] bbswitch: Found integrated VGA device 0000:00:02.0: \_SB_.PCI0.GFX0
[   14.186073] bbswitch: Found discrete VGA device 0000:01:00.0: \_SB_.PCI0.PEG0.PEGP
[   14.186079] ACPI Warning: \_SB_.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20141107/nsarguments-95)
[   14.186122] bbswitch: detected an Optimus _DSM function
[   14.186132] pci 0000:01:00.0: enabling device (0000 -> 0003)
[   14.186164] bbswitch: Succesfully loaded. Discrete card 0000:01:00.0 is on
[   14.188705] ppdev: user-space parallel port driver
[   14.255353] wmi: Mapper loaded
[   14.299229] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
[   14.312440] Bluetooth: Core ver 2.20
[   14.312452] NET: Registered protocol family 31
[   14.312453] Bluetooth: HCI device and connection manager initialized
[   14.312455] Bluetooth: HCI socket layer initialized
[   14.312457] Bluetooth: L2CAP socket layer initialized
[   14.312462] Bluetooth: SCO socket layer initialized
[   14.314616] usbcore: registered new interface driver btusb
--
[   16.967846] Bluetooth: BNEP (Ethernet Emulation) ver 1.3
[   16.967848] Bluetooth: BNEP filters: protocol multicast
[   16.967851] Bluetooth: BNEP socket layer initialized
[   17.097357] init: cups main process (934) killed by HUP signal
[   17.097362] init: cups main process ended, respawning
[   18.053497] r8169 0000:03:00.1 eth0: link down
[   18.053521] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   18.054570] iwlwifi 0000:04:00.0: L1 Enabled - LTR Enabled
[   18.054757] iwlwifi 0000:04:00.0: L1 Enabled - LTR Enabled
[   18.068951] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[   23.143307] bbswitch: disabling discrete graphics
[   23.143316] ACPI Warning: \_SB_.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20141107/nsarguments-95)
[   29.139937] systemd-hostnamed[1623]: Warning: nss-myhostname is not installed. Changing the local hostname might make it unresolveable. Please install nss-myhostname!
[   46.982117] audit_printk_skb: 168 callbacks suppressed
[   46.982119] audit: type=1400 audit(1445801247.879:68): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/lib/cups/backend/cups-pdf" pid=1669 comm="apparmor_parser"
[   46.982123] audit: type=1400 audit(1445801247.879:69): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/sbin/cupsd" pid=1669 comm="apparmor_parser"
[   46.982343] audit: type=1400 audit(1445801247.879:70): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/sbin/cupsd" pid=1669 comm="apparmor_parser"
[  115.014906] wlan0: authenticate with 00:26:3e:52:11:02
[  115.018714] wlan0: direct probe to 00:26:3e:52:11:02 (try 1/3)
[  115.221157] wlan0: direct probe to 00:26:3e:52:11:02 (try 2/3)
[  115.425080] wlan0: direct probe to 00:26:3e:52:11:02 (try 3/3)

Linux bernardo-P650SE-A 3.19.0-31-generic #36~14.04.1-Ubuntu SMP Thu Oct 8 10:21:08 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Ubuntu 14.10

https://bugs.launchpad.net/debian/+bug/752542/+attachment/4504886/+files/CLEVO-P65xSE-A.tar.gz

BernardoGO commented 8 years ago

Was it the black flickering in full screen Windows? I had it fixed by using Wayland or KDE.

It used to happen mostly when things are changed in the screen On May 5, 2016 2:16 PM, "Jacob Mischka" notifications@github.com wrote:

4.6 fixed an unrelated issue I was having with a flickering/hanging display: https://bugs.freedesktop.org/show_bug.cgi?id=94161.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/Bumblebee-Project/bbswitch/issues/115#issuecomment-217249880

jacobmischka commented 8 years ago

No, it was over aggressive power saving that would make the skylake integrated graphics just stop working for a fraction of a second at random times.

BernardoGO commented 8 years ago

Do you have this black flickering as well?

jacobmischka commented 8 years ago

I did until 4.6, yes.

Edit: I mean I had the kind I mentioned. I don't have the kind you mentioned as far as I know.

BernardoGO commented 8 years ago

I'm trying it right now. Just installed it. Version: 4.6.0-1-ARCH I can see that indeed the problem is not fixed also for Broadwell. But it seems like my flickering in Gnome is not happening anymore. I have to try it for a little longer before confirming it.

Have you tried it with Nouveau? I saw somewhere that the kernel 4.6 allied to the new nouveau will solve the problem related to the suspension.

jacobmischka commented 8 years ago

I haven't tried nouveau, isn't the performance still terrible?

jacobmischka commented 8 years ago

Wait, where did you find that kernel? 4.6.0-1-ARCH?

ArchangeGabriel commented 8 years ago

I’m not sure what linux-mainline from AUR output as kernel version, but I think it’s that. ;)

jacobmischka commented 8 years ago

I'm using mainline and it didn't report that, so that's why I was asking if he found something better. Anyway, no big deal.

BernardoGO commented 8 years ago

@jacobmischka It is not the mainline. I'm using the linux-git from AUR. I don't know why is it not reporting it as RC version. It is the 4.6rc6

@hundredyearslate Fixed what?

BenThompson22 commented 8 years ago

@jacobmischka I think that nouveau will probably perform better now since nvidia is providing the firmware for it. They have released it in February and it was supposed to be available for us after the 4.6 release.

CykaBlyat22 commented 8 years ago

Are you guys also suffering some graphic glitches after installing the optimus setup using the bbswitch? Specially in Cinnamon, I'm having many rendering problems.

Lekensteyn commented 8 years ago

I have resumed debugging yesterday and obtained some traces from Windows 10. Files that can be analyzed are available at https://lekensteyn.nl/files/p651ra-acpi-debug/

The packet capture (containing Windbg/KD traffic for associating events with times or other information) can be dissected with https://github.com/Lekensteyn/kdnet (key 8.8.8.8). This capture file has acknowledgement packets and traffic other than UDP 51111 removed to reduce size.
kd-filtered.log is created with tr -d '\r' <kd.log | grep -vE 'ignore|[Aa]ssertion failure|^being terminated|^If you want to force|^$' > kd-filtered.log
amli.log and amli2.log are from a different session but have !amli set traceon enabled.
dmesg-4.4.0-3-ARCH.2.txt shows the infinite loop.

For an acpidump see https://github.com/Lekensteyn/acpi-stuff/tree/master/dsl/Clevo_P651RA

I have not fully analyzed it yet, but after a quick look it seems that Windows 10 first calls _PS3/_PS0 and only then it calls _DSM while bbswitch and nouveau do the opposite. To be continued...

Lekensteyn commented 8 years ago

@hundredyearslate It is probably not needed, I already got an interesting observation that matches what others have reported before (in #112, https://lkml.org/lkml/2016/3/9/65 and many other places. (nice name btw, hopefully it did not refer to my comment delays :p ;) ).

(If you can easily retrieve it, then maybe having some comparison material would be nice; use !amli set traceon spewon verboseon once you have the kernel debugger attached. This requires a checked (=with debugging symbols) Windows build though, I was able to get one through my university's study association)

So the main problem is that bbswitch/nouveau still uses DSM calls which are unusable/untested for newer devices. Apparently you have to disable the parent device (likely the PCIe port) to put the Nvidia card in D3cold state. Information about this state can be found in the ACPI 6.1 specification, section 7.3.11 _PR3 (Power Resources for D3hot). Paraphrased/my interpretation: if there is a _PR3 for a device, the OS can turn off the power resources after executing _PS3 (by calling the _OFF method of those power resources, this will enter D3(cold) state).

(according to table 7-224 on page 401, D3cold is supported by providing _PR3)

My laptop for example has a \_SB.PCI0.PEG0._PR3 object evalulating to PG00 (\_SB.PCI0.PEG0.PG00). Thus I need to call \_SB.PCI0.PEG0.PG00._OFF on this device after calling \_SB.PCI0.PEG0.PEGP._PS3. (I have not found the ACPI spec line that says that _PR3 should be looked up in parent devices (PEGP) though)

From amli2.log I can see this sequence:

AMLI: ffffe001e8ec7040: AsyncEvalObject(\_SB.PCI0.PEG0.PEGP._PS3)
AMLI: FFFFE001E8EC7040: \_SB.PCI0.PEG0.PEGP._PS3()
ffffe001ef96f002: {
ffffe001ef96f002: If(LEqual(OPCE=0x2,0x3)=0x0)
ffffe001ef96f024: Store(0x3,_PSC)=0x3
ffffe001ef96f02b: }
AMLI: ffffe001e8ff3040: AsyncEvalObject(\_SB.PCI0.PEG0.PG00._OFF)

OPCE is initialized with 2 and is only possibly changed to 3 via the Optimus DSM method (which is apparently deprecated/not called in Windows 10). The corresponding _PS3 method is:

Method (_PS3, 0, NotSerialized) {
    If ((OPCE == 0x03)) { // <-- false (0x2 != 0x3)
        If ((DGPS == Zero)) {
            _OFF ()
            DGPS = One
        }
        OPCE = 0x02
    }
    _PSC = 0x03 // <-- executed
}

So it appears that Windows immediately turns off the power resource of the parent PCIe port after calling _PS3.

For powering on the graphics card, I see the following sequence:

AMLI: ffffe000c5ac7040: AsyncEvalObject(\_SB.PCI0.PEG0.PEGP._PS0)
AMLI: ffffe000ce289040: AsyncEvalObject(\_SB.PCI0.PEG0.PG00._ON)
AMLI: ffffe000c5ac7040: EvalNameSpaceObject(\_SB.PCI0.PEG0.PEGP._DSM)
String(:Str="------- GPS DSM --------")
String(:Str="GPS fun 2a")
AMLI: ffffe000c5ac7040: EvalNameSpaceObject(\_SB.PCI0.PEG0.PEGP._DSM)
AMLI: ffffe000c5ac7040: EvalNameSpaceObject(\_SB.PCI0.PEG0.PEGP._DSM)
AMLI: ffffe000c5ac7040: EvalNameSpaceObject(\_SB.PCI0.PEG0.PEGP._DSM)
String(:Str="------- GPS DSM --------")
String(:Str="GPS fun 19")

With the DSM parameters (in ssdt7.dsl for Clevo P651RA) being:

// calls "GPS DSM" and does some magic
AMLI: FFFFE001E8EC5040: \_SB.PCI0.PEG0.PEGP._DSM(Buffer(0x10){
    0x01,0x2d,0x13,0xa3,0xda,0x8c,0xba,0x49,0xa5,0x2e,0xbc,0x9d,0x46,0xdf
    0x6b,0x81},0x100,0x2a,Buffer(0x4){
    0x02,0x03,0x00,0x00})
// func 0x05, does more magic
AMLI: FFFFE001E8EC5040: \_SB.PCI0.PEG0.PEGP._DSM(Buffer(0x10){
    0xf8,0xd8,0x86,0xa4,0xda,0x0b,0x1b,0x47,0xa7,0x2b,0x60,0x42,0xa6,0xb5
    0xbe,0xe0},0x100,0x5,Buffer(0x4){
    0x00,0x00,0x00,0x00})
// func 0x1B, smaller magic
AMLI: FFFFE001E8EC5040: \_SB.PCI0.PEG0.PEGP._DSM(Buffer(0x10){
    0xf8,0xd8,0x86,0xa4,0xda,0x0b,0x1b,0x47,0xa7,0x2b,0x60,0x42,0xa6,0xb5
    0xbe,0xe0},0x100,0x1b,Buffer(0x4){
    0x00,0x00,0x00,0x00})
// smaller magic
AMLI: FFFFE001E8EC5040: \_SB.PCI0.PEG0.PEGP._DSM(Buffer(0x10){
    0x01,0x2d,0x13,0xa3,0xda,0x8c,0xba,0x49,0xa5,0x2e,0xbc,0x9d,0x46,0xdf
    0x6b,0x81},0x100,0x13,Buffer(0x4){
    0x04,0x00,0x00,0x00})

I wonder what those DSM methods are used for and whether these also occur on other laptop models.

CaballoSinNombre commented 8 years ago

Have you guys tried to change the acpi_osi to report an older version of windows? It works for me

Lekensteyn commented 8 years ago

Setting acpi_osi="!Windows 2015" might work for older devices (that should work with Windows 7 or something), but for newer devices it will be increasingly more probable to be non-working (because Windows 10 uses the new interface and vendors are likely cheap and do not validate for older OSes).

jacobmischka commented 8 years ago

I'm sorry, I'm a bit lost with all of the recent talk in this project's issues and updates in bumblebee.

Should I be running the develop branch of bumblebee, is running a version including https://github.com/Bumblebee-Project/Bumblebee/pull/762 better than running the current stable build and blacklisting nvidia, nvidia-drm, nvidia-modeset, and nvidia-uvm in a modprobe conf file? Will any of these things make any difference to bbswitch?

Is there anything I should be doing differently, or is your suggestion essentially to just stick to rmmodding bbswitch before suspending until the PCIe changes come in 4.7, or until the DSM calls are straightened out?

Should I try using the disable_root_port fork mentioned in #112? Is that the same thing as pcie-root-port which is mentioned later? Is there a specific acpi_osi setting I should be using? I'm confused about which suggestions in #112 I should be considering, because although the thread initially wasn't about suspend, it's mentioned in there several times.

I apologize for all the questions, but there are so many things being mentioned across various issues that I don't really know what I'm supposed to be doing. Thank you for your help.

Lekensteyn commented 8 years ago

@jacobmischka The develop branch of Bumblebee is currently recommended over the master branch for compatibility with newer nvidia driver versions.

The hack from #112 should not be needed with Linux 4.7 and an appropriate version of bbswitch (not released yet). While it is not a problem with overheating during suspend, it is related to the fact that newer machines expect a different interface to be used (power resources _ON/_OFF instead of _DSM). About the disable_root_port fork, the root port is already controlled by the pcieport driver, I am not sure if it is a good idea to manage it in bbswitch too... that sounds risky (race conditions).

Could you open a new issue for your Acer E5-574G and include include your BIOS version and the output of sudo acpidump > acpidump.txt? Edit: I checked the ACPI table from BIOS 1.14 for your model and found that your model indeed expects control of power resources.

jacobmischka commented 8 years ago

Done, I don't remember if referencing issues results in a notification or not. Thanks!

Evergreen1992 commented 8 years ago

:) I'm just passing by

Anti-Ultimate commented 8 years ago

Still no way to fix this ? :(

Lekensteyn commented 8 years ago

@Anti-Ultimate Not fix available in a stable version of bbswitch or the kernel. If you do not mind using the mainline kernel, try Linux 4.8-rc1 or newer with the nouveau module (and not bbswitch).

BernardoGO commented 8 years ago

@Lekensteyn does nouveau support Optimus without bbswitch?

Lekensteyn commented 8 years ago

@BernardoGO If you only need to save power, then both nouveau and bbswitch are functional. If you need to connect an external monitor, then dump bbswitch for nouveau. If you need to use the blob, then you can also try to use nouveau, but you would have to manually unload nouveau before loading nvidia.

firetech commented 7 years ago

I may be late discovering this, but it seems like upgrading my laptop (Clevo P651SE/XMG P505, GTX 970m) from Debian jessie to stretch (Linux 4.9, bbswitch 0.8, nvidia driver 375.82) fixed this issue. At least, nvidia-smi did not report any temperature difference after having the laptop suspended for 5-10 minutes. :)

Lekensteyn commented 7 years ago

@firetech Does bbswitch actually work? Can you see that in dmesg? If runtime PM is not enabled or if bbswitch is not activated, then the problem is not triggered.

firetech commented 7 years ago

@Lekensteyn I'm quite sure bbswitch works, but I haven't double checked. My laptop has a status LED for the dGPU and it is off unless I start an application with optirun/primusrun. Also, the LED comes on just before suspend (but obviously turned off while in suspend), just like in Windows. I'll double check the dmesg later, my laptop isn't with me at the moment.

EDIT: bbswitch is definitely working:

[ 5.916385] bbswitch: Succesfully loaded. Discrete card 0000:01:00.0 is on [ 5.917197] bbswitch: disabling discrete graphics

Me running optirun nvidia-smi:

[82398.146560] bbswitch: enabling discrete graphics [82400.369356] bbswitch: disabling discrete graphics

firetech commented 7 years ago

@Lekensteyn Also, at least on my laptop, I got the overheating also when I tried an Ubuntu live USB (when it was new, so late 2014) without anything optimus, nvidia och bbswitch related installed. The dGPU was left on without anything touching it. Since then, I haven't touched the BIOS/UEFI settings.

Bumblebee-Project / bbswitch

GTX970m still on and overheating while suspended #115