Bumblebee-Project / bbswitch

Disable discrete graphics (currently nvidia only)
GNU General Public License v2.0
487 stars 78 forks source link

Asus UX501VW: Disabling GPU causes loss of fan control (with fan running at max speed) #134

Closed DRosky closed 8 years ago

DRosky commented 8 years ago

Hello, I have a new Asus ux501vw (skylake version) with an Nvidia GTX960M. When bbswitch is used to turn off the GPU, after about 15-20 seconds, the cooling fans begin running at maximum speed. bbswitch reports that the gpu is off, and I believe the GPU may actually be off because the power consumption does drop somewhat, so the fans running at max speed is not a heat problem. In fact, the CPU temperature drops to below 30 deg. C because both chips are on the same heatsink. The fans can never be turned off again via any method without shutting down the system. Just rebooting does not shut them off. Here are a list of the symptoms and things that I've noticed or tried:

  1. Power consumption drops, so the GPU does appear to be turning off, at least partially.
  2. Although the power drops, it is still significantly higher than in Windows 10.
  3. turning the card back on with bbswitch does not stop or slow the fan.
  4. Once the fans start, they can no longer be controlled with the pwm interface exposed by asus_nb_wmi. The asus_nb_wmi interface does, however, still report the fan speed. Before the GPU is turned off, the fans can be controlled via the asus_nb_wmi interface. After turning off the GPU, the fans cannot be controlled by any mechanism.
  5. Restarting does not turn off the fans, a complete shutdown is required.
  6. Everything runs fine in Windows 10, so this doesn't seem to be faulty hardware or BIOS. There are a few other reports of this on various machines.
  7. Everything else with bumblebee seems to work fine. The Nvidia driver loads (it now seems to be composed of two modules), and "optirun glxinfo" returns what you would expect.
  8. The following error messages appear in the log after running bbswitch:

[ 217.578972] bbswitch: disabling discrete graphics [ 217.579000] ACPI Warning: _SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires Package (20160108/nsarguments-95) [ 217.596320] pci_raw_set_power_state: 553 callbacks suppressed [ 217.596330] pci 0000:01:00.0: Refused to change power state, currently in D0 linux-adyu:~ #

  1. Distro: Opensuse Tumbleweed, Linux 4.6.2-1-default #1 SMP PREEMPT Fri Jun 10 08:12:44 UTC 2016 (2a68ef0) x86_64 x86_64 x86_64 GNU/Linux

10 Attached acpidump file.. acpidump.txt

I'm not sure if this is just another manifestation of a known problem, but I don't see these symptoms reported here, so I decided to open an issue.

Regards, David

DRosky commented 8 years ago

I noticed on the main page a request for some desired info that I did not provide. I am attaching that here (output of get-acpi-info.sh and the dump_info kernel module).

-David ASUSTeK_COMPUTER_INC.-N501VW.tar.gz dump_info.txt

Lekensteyn commented 8 years ago

Hey David,

Thank you for the report and acpidump. For future reference, I think this is where BIOS can be downloaded for your laptop: https://www.asus.com/us/Notebooks/ASUS-ZenBook-Pro-UX501VW/HelpDesk_Download/

Your PGON ACPI method looks very similar to my Clevo P651RA which hangs under some circumstances. Maybe you have the same issue, can you still suspend/resume after the fan spinning loud event?

Can you try bbswitch separately from Bumblebee and not load nvidia?

Peter

DRosky commented 8 years ago

Hi Peter, Thanks for the reply! A few things:

I have not tried suspend and resume after the fan starts. I will try that later and let you know the result.

-David

EDIT: Also, the problem is very consistent, is happens under all circumstances if I try to use bbswitch to turn off the card.

DRosky commented 8 years ago

After reading the issue you referenced (laptop freezes), I notice that I have not tried to use bbswitch after booting with text only (no X). I will try that as well and report the result.

So far I never had a freeze, just the fan problem the fan controls (pwm) becoming locked until the machine is shut off.

Lekensteyn commented 8 years ago

Since you have tried bbswitch already which showed the same problem, could you give nouveau a go? When the issue persists, please give try this patch on top of Linux 4.6 for nouveau: https://lekensteyn.nl/files/linux-v4.6-pcipm-nouveau-pm2.patch

DRosky commented 8 years ago

Some results:

Lekensteyn commented 8 years ago

It might be possible to apply patches to the OpenSUSE version, but I would suggest to use the vanilla kernel instead to exclude possible problems caused by OpenSUSE's patches.

OpenSUSE docs seems available at https://en.opensuse.org/openSUSE:Kernel_git#Building_kernel_packages

Normally you can grab a tarball or clone the repo, then apply the patches, make/copy a kernel config, build and install.

gunzip -c /proc/config.gz > .config  # use old kernel configuration
make oldconfig  # update the kernel configuration (just press Enter to accept new entries)
make  # build modules and image
sudo make modules_install  # install modules to /lib/modules/4.6.../

Then you have to install the kernel image (arch/x86/boot/bzImage) somewhere in /boot/ and (re)create an initial ramdisk (distro-dependent stuff).

DRosky commented 8 years ago

OK, thanks for the info. I'll start with the OpenSUSE page and go from there. I noticed that your patch patches a number of modules, not just Nouveau. I'll probably try it first with the OpenSUSE kernel and patches because if it works, I might end up with a fully functioning system ;)

DRosky commented 8 years ago

I had a thought along a different line. Do you know if there is any tool that can capture ACPI calls in Windows 10? If so, perhaps the correct call could be captured while running a 3D application that causes the GPU to be turned on and then back off again... Just trying to think out of the box..

Lekensteyn commented 8 years ago

Capturing ACPI calls in Windows 10? Who would do such a thin... oh hey, https://github.com/Bumblebee-Project/bbswitch/issues/115#issuecomment-218551781 :smiley:

The nouveau patches combined with some PCI core patches are supposed to perform these calls. These changes will likely end up in Linux 4.8.

DRosky commented 8 years ago

Capturing ACPI calls in Windows 10? Who would do such a thin... oh hey, #115 (comment) :smiley:

Haha! I finally read those comments. In addition to the kernel patch (which I will do as soon as I have a few spare hours), I was thinking I could also try tracing Windows ACPI calls on this machine if it would help. I'm not sure what tool to use, but more importantly, it seems to need a special build of Windows, which is probably a bigger issue :(

Hopefully it won't be necessary if the newer ACPI interfaces are the same across newer machines.

Lekensteyn commented 8 years ago

I used a Checked/Debug build of Win10 and a remote WinDbg/KD (kernel debugger). I don't think that an additional trace is needed, the patches I mentioned should fix the issue.

verge-36 commented 8 years ago

Any progress? I have the exact same problem with ASUS X550V i7 6700hq +nvidia gtx 950m I am running arch linux btw.

Lekensteyn commented 8 years ago

@verge-36 You can try nouveau with Linux 4.8-rc1 kernel (or newer). Do not use bbswitch or the nvidia blob in that case.

zulucoda commented 8 years ago

Hi All,

I thought I'd help out by posting a comment regarding my experience with ASUS UX501VW. I've got Ubuntu 15.10 installed and kernel version is 4.5.0 (for touchpad to work). I've got NVidia drivers installed version 352.63. Within NVidia Settings there's option called "PRIME Profiles" under this section there's 2 options to select:

I thought I would try the Intel Power Saving Mode, when I selected this option I restart my machine. While the machine was booting up when it got to the login screen the FAN came on at max speed. I tried shutting-down again had the same problem, so I logged in, went to NVidia Settings and set the PRIME Profiles back to NVIDIA (Performance Mode), saved and restarted the machine.

So I dont think bbswitch is the problem since I dont have Bumblebee installed I was thinking of installing Bumblebee thinking it would solve the problem I had but then I came across this thread.

So I think the problem here is the NVidia drivers, when the NVidia card is off it causes the fan to run at max speed.

hope this helps,

ta. muzi

screenshot from 2016-09-19 13-03-01

DRosky commented 8 years ago

Some updates:

  1. I apologize my life became very busy around the time I was describing these issues and I never had time to learn the kernel hacking procedure and do the kernel re-builds.
  2. OTOH, openSUSE tumbleweed has now been at kernel 4.8 for a while (currently at 4.8.4) so I decided to give things a try again, since many of the patches were supposed to be present in 4.8. The results are mixed, but there is an overall improvement:
    • The bad news: the version of bbswitch provided is still 0.8 (no change) and running bbswitch still results in the fan problem.
    • The good news: unloading nvidia and loading nouveau now no longer causes the problem, and in addition, the temperature drops from 39 deg. to 33-34 deg., a change of -5 degrees. In addition, the Gnome panel now reports the idle battery time as 8.5 hours. This is still a bit worse than Windows, but better than the previous 5.5 hours. I can only assume that nouveau is either turning off the Nvidia GPU, or at least setting it into a low power state. The only downsides at the moment is that with the nouveau driver loaded, the system hangs on shutdown and the driver cannot be unloaded using modprobe -r. I don't know if this is a fundamental problem or an interaction with the Nvidia driver, which was installed from tarball, not from RPM.

Also, the kernel modules have changed. When nouveau is loaded, it is dependent on the following modules:

ttm mxm_wmi video i2c_algo_bit drm_kms_helper drm wmi button

One or two of these modules seem to have something to do with power management, or with new ways of interfacing with the GPU, particularly the mxm_wmi module.

Note: I haven't yet tried updating to the latest binary nvidia driver, if there is a newer one. I'll check and try that. Ideally, the Nvidia driver should shut down the GPU (or at least implement maximum power savings) when it isn't being used similar to what the nouveau driver now seems to be doing, then the extra layer of bbswitch wouldn't even be needed.

Also, I haven't tried suspend/resume yet.

Lekensteyn commented 8 years ago

@DRosky bbswitch and kernel 4.8 not working well (in particular with runtime PM enabled via laptop-mode-tools or the equivalent) is a known issue. There is no timeframe yet for a fix, the workaround is to boot with the pcie_port_pm=off option added to your cmdline. See the dozens of other issues in the bbswitch issue tracker.

As for the hang on shutdown, see https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-234494238 for a newly added workaround.

The nvidia blob will give a worse experience with your battery (it does not support the PM methods from bbswitch nor nouveau), avoid it if you can. I recommend nouveau over bbswitch, the only reason to use bbswitch is if the nouveau driver does not support your card or if you plan to use it with the binary blob.

DRosky commented 8 years ago

@Lekensteyn ,

Thanks. I started catching up on the other threads (still catching up :) ) Some additional observations based on those suggestions:

  1. I get the same result with pcie_port_pm=off as without it. Either way, tee /proc/acpi/bbswitch <<<OFF causes the full-speed fan problem.
  2. I tried the workaround acpi_osi=! acpi_osi="Windows 2009". This did eliminate the hang on shutdown, but unfortunately it also caused the nouveau driver to be unable to reduce the GPU power. It seems that in order to power-manage the GPU, the nouveau driver (or one of the other modules on which it depends) needs some functionality that is not available when these kernel parameters are used.
DRosky commented 8 years ago

I read through the kernel bug report and verified that lspci also hangs (when the workaround is not being used). Trying to unload the nouveau driver also does not work. Same with suspend/resume. As mentioned previously, the acpi_osi workaround seems to prevent the nouveau driver from powering down the GPU. I haven't read all of the comments yet, so I don't know if that's happening on all affected machines.

DRosky commented 8 years ago

I tried one more experiment. I set pcie_port_pm=off along with loading the nouveau driver. In this case, the nouveau driver reverted to its previous behavior shown in kernel 4.6, whereby it once again caused the full-speed fan issue.

In summary, at the moment with this machine, only the nouveau driver with PCIE port power management and no acpi_osi limitations is able to reliably power down the GPU without causing fan problems, but then it cannot be powered back on, resulting in hangs.

EDIT: I finished reading through the other threads. It appears that the problems fall into two general categories: 1) Newer laptops where bbswitch worked fine prior to kernel 4.8 but where bbswitch is now broken with 4.8, and 2) newer laptops where bbswitch already had issues (such as the fan speed issue) on older kernels, and now there are different/additional issues on 4.8. The Asus UX501 seems to be in the second category.

Machines in the first category can be helped by the pcie_port_pm=off workaround, whereas machines in the second category can't, since that just reverts to the original problems (e.g, fan speed).

Lekensteyn commented 8 years ago

Right, for your first problem (fan control), you must use the new method in 4.8 with nouveau, the old method (DSM, forced via pcie_port_pm=off in 4.8) will definitely not work in you case.

The second problem (hang on suspend) would occur in any case where you use bbswitch or nouveau without the acpi_osi workaround. (Note: this problem is device-dependent, workarounds are described in https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-234494238).

DRosky commented 8 years ago

Yes. The only problem with the last part is that the acpi_osi workaround, at least on this machine, while preventing the hangs, causes nouveau to no longer be able to turn off the GPU, so it defeats the purpose. Hopefully, the root cause of the inability to turn the GPU back on will be found to avoid needing this workaround.

Ultimately, I suppose, bbswitch will want to incorporate the new PM method for newer laptops, otherwise it will not be possible to use the Nvidia blob with power management in bumblebee on machines like mine and others with similar issues.

As an aside, I did have a weird quirk happen this morning. The laptop mysteriously booted up with the GPU off and no nvidia driver loaded. I'm guessing that nouveau's inability to turn the GPU back on, and my subsequent needing to force the hung machine off with the power button, left the GPU in an off state that survived a restart! There are some scary things in these UEFI firmware...

Lekensteyn commented 8 years ago

Even if bbswitch adapts the new PM method, you would still need to solve the hang problem that prevents good power off/resume/etc. Hopefully some progress can be made in the PCI bug, until then you can try to override the ACPI method to remove the If (OSYS == 0x07D9) from the PGON method in SSDT4.

Doing that is an exercise for the reader, see https://www.kernel.org/doc/Documentation/acpi/method-customizing.txt

DRosky commented 8 years ago

@Lekensteyn , your notes (https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt#L94) show the code as If ((OSYS != 0x07DF)) (that is, !=, not ==, and 0x07DF rather than 0x07D9)? Either way, I assume the objective is to modify the code so that the Windows10-specific code segment does not get executed, correct?

Lekensteyn commented 8 years ago

@DRosky The code is model-specific, in my case there the condition OSYS != 0x07DF, but in your case it is OSYS == 0x07D9 which is why you need a different acpi_osi workaround. The objective is to replace this condition by If (One) which ensures that the code is always executed.

An automated tool should be possible:

If ( OSYS  == ... ) {  // or != instead of ==
  ...
} Else {
   LKEN (...)
}
DefinitionBlock(...) {
    // TODO need some External(...) references here?
    Method (\_SB.PCI0...PGON) {
        ...
    }
}

Here is an example of an SSDT which I used to patch the battery method: https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-B7130/BatteryFix.dsl

DRosky commented 8 years ago

@Lekensteyn , Thank you very much for that info. An automated method would be nice since it could be easily applied to any machine. I think with that and the last link, there is enough information to try it.

I actually have also found a simpler workaround, for this machine at least. As I mentioned previously, turning off PCIE port PM caused bbswitch to revert to the old problem with the fan speed, whereas the acpi_osi workaround prevents nouveau from power-managing the GPU. It then occurred to me that there is a possibility that telling the firmware you are "Windows 2009" might modify the behaviour of the _DSM methods in the firmware, so I added the acpi_osi parameters in addition to pcie_port_pm=off, and it looks like the hunch was correct, now bbswitch works well, with no fan speed problems.

I did some initial testing, and on this machine, it doesn't seem that reporting as Windows 2009 causes any hardware to become less functional, with the exception that screen brightness steps are more coarse. Bbswitch, optirun, and bumblebeed are all working well with the Nvidia binary blob.

I might still play around with modifying the PGON method, especially if the kernel PCI bug is not found soon, since that is a more forward-looking solution. Having that would pave the way for using the Nvidia blob in bumblebee once bbswitch incorporates the new PM method.

I'm going to go ahead and close this thread since those things are being tracked in more topic-appropriate threads. Thanks again for all of the help and for all the software you've created.

Lekensteyn commented 8 years ago

Ha, nice find with combining the two workarounds into a newer one that works for you :-)

Hopefully the cause of the new PM issue can be found, but at the moment it is not really going fast.

dexterlb commented 8 years ago

Is there a way to work around this problem with the binary nvidia driver?

zkanda commented 8 years ago

@DRosky Can you show the final kernel parameter that you added? Is it something like this?

acpi_osi=! acpi_osi='Windows 2009' pcie_port_pm=off 

I had to add acpi_backlight=native to get my screen brightness keyboard working.

guhjys commented 7 years ago

Hi all. Can't solve a problem with the fan speed on the Asus n551VW Skylake 6700 gtx 960 kernel 4.8.12 Debian The problem is the same that everyone with bbswitch >> off the fan raises rpm to 25500. Although the Maximum rpm of 4300. acpi_osi=! acpi_osi='Windows 2009' pcie_port_pm=off does not help. fancontrol leave in case of emergency. If anyone knows how to fix it please answer!

DRosky commented 7 years ago

I apologize I haven't noticed the updates to this thread until now. For me the following combination did work:

acpi_osi=! acpi_osi='Windows 2009' pcie_port_pm=off

This allowed the existing bbswitch (0.8) to disable the GPU without causing the fan problem. A caveat here is that I haven't had time to do updates for a while, so I'm still using kernel 4.8.7. If there's been a regression in this area, it might break for me when I upgrade to 2.8.12 (I've just been too busy to upgrade recently). In case that's part of the problem, you might want to try 2.8.7. If there's been a regression, it would be good to know. I'll also report any changes when I update.

As for the screen brightness keys, yes those have been broken from the beginning. I'm not in the habit of using the keyboard for that (I usually use the GUI controls which do work), but it's good to know the acpi_backlight=native option does work.

Lekensteyn commented 7 years ago

Backlight keys could be fixed with 4.10 via https://cgit.freedesktop.org/drm-intel/commit/?h=drm-intel-next&id=8e1b56a4b1deb3d25674c49255388902901f2c45

DRosky commented 7 years ago

@raidhon , I just updated my N501VW to 4.8.12 and everything is still working fine. I just noticed that your post is regarding an N551VW, which is a different model. You might want to check the link below (provided previously by Lekensteyn) to see if there is a specific work-around that is known to work with that model. If there isn't, perhaps some experimenting with the Nouveau driver might be worth trying for your machine (it didn't work for this one).

https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-234494238

guhjys commented 7 years ago

@DRosky thanks for the reply. I tried all the ways from this post, is not working. Nouveau does not suit me, as I use the GPU for simple experiments with neural networks. And for this I need the Nvidia driver. I have everything working, only fan crazy with their noise after bbswitch >> off. We have to turn off and turn on the laptop ( service bumblebeed I have not in the startup) and the fan begins to operate normally. It does not interfere with my experiments , just angry ))

If you are difficult to describe in detail what you have done what you have earned Maybe I missed something, can version of one of the packages not the same. I would be so grateful!!

DRosky commented 7 years ago

@raidhon , Everything I did is captured in this thread, but in summary, the way the firmware handles some ACPI calls has changed in newer machines, which causes various problems in some cases. On some machines, disabling the newer PCIe port power management and telling the firmware you are an earlier version of windows ("Windows 2009" = Windows 7) causes the ACPI calls to be handled differently and the older method works, but this is not guaranteed and it varies from one machine to the next (for all I know, these machines might even have trouble running actual Windows 7). Until there is a solution where Linux can reliably use the new power management scheme, there will be problems on some machines.

I also use the Nvidia binary driver to access the GPU for image processing purposes, so even if the Nouveau driver could properly turn the GPU off and on without hangs, it doesn't help me that much either. Before I found this work-around, I just accepted that the GPU was powered on all the time. The battery operating time is reduced, but other than that everything worked fine.

One last thing is to make sure you have updated to the most recent UEFI firmware for your machine, in case you haven't already checked that.

sohrabi924 commented 7 years ago

hi i have exactly the same problem . ux501vw and gtx960m the fan is always on could you solve this problem?

DRosky commented 7 years ago

@sohrabi924 , I haven't been following this for a while, so I don't yet know if the fundamental problem with newer UEFIs designed for Windows 10 has been solved, but for the ux501vw, the workaround shown above has worked for me to enable bumblebee without the fan problem. For reference, the workaround is:

acpi_osi=! acpi_osi='Windows 2009' pcie_port_pm=off

Note, this is with OpenSuSE Tumbleweed and has worked all the way through kernel 4.11.x so far.

Note, I haven't updated the UEFI (BIOS) since discovering this workaround, so if you have a more updated UEFI and the workaround doesn't work, then possibly there's a problem there.

-D

Peuczynski commented 6 years ago

FYI, I don't think this is bbswitch specific problem. Although I have experienced this too on previous OS. Following arch wiki tutorial to disable dGPU completely modprobe acpi_call /usr/share/acpi_call/examples/turn_off_gpu.sh gives the same max-fan result

And I can confirm that for ASUS UX550VE the combination GRUB_CMDLINE_LINUX="acpi_osi=! acpi_osi=\"Windows 2009\" pcie_port_pm=off acpi_backlight=native" calms down the fans on GPU disable

clouedoc commented 6 years ago

Hey ! For users who are trying to fix this problem on a faulty laptop, there is a temporary solution to calm down your fans.

The solution is to trick bbswitch to enable the gpu before the fans goes crazy. To do that, ensure bbswitch is loaded (modprobe bbswitch), then execute glxgears on your nvidia card:

sudo optirun glxgears

update: I can't find any clean way to add these options, and I can't find any proper grub config file in /etc/. Any help ?

second update: found it: /etc/default/grub. Don't forget to backup your older command line

julientaq commented 5 years ago

@jesuiscamille where did you put the optirun glxgears? i can't find a way to make it work :man_shrugging:

thanks!

clouedoc commented 5 years ago

@julientaq hey there ! You may run the command in a shell/terminal.

julientaq commented 5 years ago

Are you saying that running optirun glxgears in your terminal stop the fans at any time?

As i understood it, this need to happen before X is started. Where would you add a script to make this happen?

Thanks for replying :D

clouedoc commented 5 years ago

@julientaq hey ! This command needs to be ran before the fan goes crazy. It needs X to be launched.

Just close your laptop. Open it, then quicly run this command. It should do the trick.

julientaq commented 5 years ago

got it. Will try out! Thanks again!