Bumblebee-Project / bbswitch

Disable discrete graphics (currently nvidia only)
GNU General Public License v2.0
486 stars 78 forks source link

bbswitch is broken with kernel 4.8 pcie port power management #140

Open nathanielwarner opened 8 years ago

nathanielwarner commented 8 years ago

I just upgraded to kernel 4.8, and bbswitch 0.8-1 no longer works properly. When I try to run something with primusrun, it fails with "bumblebee could not enable discrete graphics card" or something, and I get this in dmesg:

bbswitch: enabling discrete graphics
pci 0000:01:00.0: Refused to change power state, currently in D3
pci 0000:01:00.0: Refused to change power state, currently in D3

When I use the kernel command line option pcie_port_pm=off primusrun works again, and I get this in dmesg upon using primusrun:

bbswitch: enabling discrete graphics
nvidia-nvlink: Nvlink Core is being initialized, major device number 242
NVRM: loading NVIDIA UNIX x86_64 Kernel Module  370.28  Thu Sep  1 19:45:04 PDT 2016
vgaarb: this pci device is not a vga device
ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  370.28  Thu Sep  1 19:18:48 PDT 2016
nvidia-modeset: Allocated GPU:0 (GPU-33c835cf-d564-600a-037b-c7ecb9188d7c) @ PCI:0000:01:00.0
nvidia-modeset: Freed GPU:0 (GPU-33c835cf-d564-600a-037b-c7ecb9188d7c) @ PCI:0000:01:00.0
vgaarb: this pci device is not a vga device
ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
nvidia-modeset: Unloading
nvidia-nvlink: Unregistered the Nvlink Core, major device number 242
bbswitch: disabling discrete graphics
ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
pci 0000:01:00.0: Refused to change power state, currently in D0

Is lack of support for Kernel 4.8 default configuration an issue that anyone else is having? I'm running Manjaro with Kernel 4.8.1-1, Nvidia driver 370.28, bbswitch 0.8-1.

mirh commented 7 years ago

Different systems have different DSDT tables and thus acpi behavior (not to mention possibly different hardware internals)

lrafa commented 7 years ago

@mirh naturally, but it was working before, and after a system update it stopped working. Hardware is definitely not the problem. I did change the kernel version, but I tried rolling back and still couldn't get bbswitch to work.

I am quite used to debugging C codes, but unfortunately I have zero experience with kernel modules.

mirh commented 7 years ago

Could it be you also got mesa/nouveau/forceware updated in the meantime?

lrafa commented 7 years ago

I actually forgot to update mesa. I've never had nouveau installed (tried few days ago, didn't work, I removed). I don't know what you mean by forceware.

mirh commented 7 years ago

It's the nvidia closed driver name for the records.

Lekensteyn commented 7 years ago

@lrafa If you use nouveau (or bbswitch) and experience freezing, have a look at https://github.com/Bumblebee-Project/Bumblebee/issues/764

Unfortunately, I observed that on my sibling's KBL laptop (i7-7700HQ with a GTX 1050), even such an acpi_osi workaround did not fully solve the lockup. For now he'll be using nouveau.runpm=0 to have a stable machine at the cost of much higher power consumption... this needs some more debugging, but I will be abroad for the next four months (cannot work on this). YMMV

postadelmaga commented 7 years ago

@chadfurman I have the same problem with powertop and PM for Nvidia breaking bbswitch: I opened an issue here: https://github.com/Bumblebee-Project/bbswitch/issues/159

qdel commented 7 years ago

Hi,

On all recent versions of kernels, i can't use my graphic card after suspend.

tried kernel parameters: pcie_port_pm=off rcutree.rcu_idle_gp_delay=1. But it changes nothing. The following if true for both with and without these parameters.

At boot: no problem, i can use my graphic card. It switchs on / off correctly. (sometime i have 'stuck in D0' but i have a script which find this and make a simple ON/OFF a second later)

If i put my laptop to suspend, the card is then stuck in D3. I tried to unload bbswitch when entering sleep and tried both unload_state 0 or 1. But it changes nothing.

I opened an issue here a long time ago (sorry for not giving updates, did not see answers since): https://github.com/Bumblebee-Project/Bumblebee/issues/625

I was using my hack to avoid the pm callback from bbswitch to make things happen "more slowly"... But is doesn't work anymore.

How can we be sure pcie_port_pm=off is active?

lrafa commented 7 years ago

@qdel Would you mind sharing your script for turning it off? What do you do? echo OFF into the appropriate file does nothing for me.

Try cat /proc/cmdline to see if the currently booted kernel was loaded with pcie_port_pm=off.

qdel commented 7 years ago

@lrafa Kind of more dirty than this. I made a little Qt program on the border of the table while drinking too much alcohol. => do not look too much at code quality, it was made only for my own laptop.

https://github.com/qdel/optimus_watcher

But:

Powered off: image Powered on: image

Unknown is orange.

Important file is: bbswitchchecker.cpp.

By the way: sorry for french commits log or comments :/

lrafa commented 7 years ago

@qdel Not really, the important file is bbswitcher.cpp

And you are turning it on/off by writing "ON" or "OFF" into /proc/acpi/bbswitch, which doesn't work here.

Thanks for sharing anyway! (and yeah, I'm also a fan of Qt :P)

qdel commented 7 years ago

@lrafa In my computer, the problem is the 'speed' of the process. If i run primusrun glxgears, press escape key to quit glxgears, 100% chance that i meet: pci 0000:01:00.0: Refused to change power state, currently in D0

It was also the case for my suspend problem. Using pm handler everything was too fast. And i lost the card. Using my scripts / programs, more slow => working. Until now.

Fincer commented 7 years ago

pcie_port_pm=off is not working for me. Kernel 4.13. Arch Linux x86_64.

P.S. Some Power Management related commits for kernel version 4.8 Linux 4.8 - ACPI, EFI, cpufreq, thermal, Power Management

senepa commented 7 years ago

tee /proc/acpi/bbswitch <<<ON with or without pci_port_pm=off is not working for me either. Kernel 4.13. Fedora Linux x86_64

Lekensteyn commented 7 years ago

@Fincer @senepa laptop model is relevant. Have you tried using nouveau instead of bbswitch? (you won't be able to use Bumblebee then, but at least you'll save power and have working external monitors.)

zx2c4 commented 7 years ago

@Lekensteyn any plans for updating bbswitch?

archenroot commented 7 years ago

@qdel - I also do best stuff while programming with bottle of vodka 💃

Lekensteyn commented 7 years ago

@zx2c4 Probably not until at least the end of this year. Progress has also stalled since I was trying to solve https://github.com/Bumblebee-Project/Bumblebee/issues/764 at the same time (without luck). Does nouveau not work for you?

eddynetweb commented 6 years ago

Having the exact same issue as @qdel - the graphics card won't start after entering a suspended state.

TungstenOxide commented 6 years ago

pcie_port_pm=off does not fix it for me either. I always freeze before I see the login screen. It gets to Started User Manager for UID 42 and that's it. Fedora 27 with Kernel 4.14.11 XPS 15 9560 Core i7-7700HQ GTX 1050

chenxiaolong commented 6 years ago

@TungstenOxide For the XPS 9560, you'll need to boot with modprobe.blacklist=nouveau. The nouveau driver causes the system to hang.

For bbswitch to work, you'll also need a kernel that supports CONFIG_ACPI_REV_OVERRIDE_POSSIBLE and boot with acpi_rev_override=5. Fedora's kernel is not complied with this option so you'll need a custom kernel. I've made a custom kernel for Fedora 27 here: https://copr.fedorainfracloud.org/coprs/chenxiaolong/kernel-acpi-rev-override/

TungstenOxide commented 6 years ago

@chenxiaolong I am booting with nouveau blacklisted and are you sure that Fedora doesn't have that support?

chenxiaolong commented 6 years ago

How are you blacklisting nouveau? The rd.blacklist option doesn't seem to work properly anymore.

TungstenOxide commented 6 years ago

I'm pretty sure that I'm using modprobe

Lekensteyn commented 6 years ago

@TungstenOxide Do you have the latest BIOS version (1.6 or 1.7 IIRC)? The rev workaround did not work with certain BIOS versions (1.5?) for the XPS 9560.

TungstenOxide commented 6 years ago

They have a new bios? Im on 1.5 so that might be it.

On Feb 1, 2018 13:41, "Peter Wu" notifications@github.com wrote:

@TungstenOxide https://github.com/tungstenoxide Do you have the latest BIOS version (1.6 or 1.7 IIRC)? The rev workaround did not work with certain BIOS versions (1.5?) for the XPS 9560.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Bumblebee-Project/bbswitch/issues/140#issuecomment-362361665, or mute the thread https://github.com/notifications/unsubscribe-auth/AL-7ANak3uRgSoj4i9VP1Apv5xbzWWf5ks5tQgU7gaJpZM4KSH-C .

TungstenOxide commented 6 years ago

@Lekensteyn No good. I'm on 1.6.2 already.

Lekensteyn commented 6 years ago

@joshu256 Looks like you solved your problem already in https://github.com/Bumblebee-Project/Bumblebee/issues/946, nvidia-smi triggers loading the nvidia module.

joshu256 commented 6 years ago

@Lekensteyn Yeah sorry for wasting your time, I'll delete the comment

real-or-random commented 6 years ago

I've been using pcie_port_pm=off for a long time and it worked. Since I've upgraded from 4.16.12 to 4.16.13, my tray icon indicates that the NVIDIA card is enabled after boot, which should not be the case. There's nothing in the bumblebee logs but when I try to restart bumblebee I get

[  490.555643] bbswitch: disabling discrete graphics
[  490.556229] ------------[ cut here ]------------
[  490.556234] pci 0000:02:00.0: disabling already-disabled device
[  490.556266] WARNING: CPU: 2 PID: 539 at drivers/pci/pci.c:1646 pci_disable_device+0x8a/0xa0
[  490.556268] Modules linked in: uinput cmac rfcomm fuse snd_hda_codec_hdmi snd_hda_codec_realtek bnep snd_hda_codec_generic qcserial usb_wwan xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack cdc_mbim cdc_wdm usbserial cdc_ncm btusb usbnet mii btrtl btbcm btintel bluetooth uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev ecdh_generic media ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter joydev mousedev bbswitch(O) snd_soc_skl
[  490.556400]  arc4 snd_soc_skl_ipc i915 snd_hda_ext_core snd_soc_sst_dsp snd_soc_sst_ipc snd_soc_acpi snd_soc_core iTCO_wdt iTCO_vendor_support mei_wdt snd_compress ac97_bus snd_pcm_dmaengine iwlmvm wmi_bmof intel_rapl x86_pkg_temp_thermal intel_powerclamp intel_wmi_thunderbolt mac80211 kvm_intel i2c_algo_bit snd_hda_intel nls_iso8859_1 nls_cp437 vfat drm_kms_helper snd_hda_codec fat iwlwifi kvm snd_hda_core drm e1000e snd_hwdep irqbypass cfg80211 intel_cstate snd_pcm intel_uncore intel_rapl_perf i2c_i801 psmouse ptp intel_gtt pps_core input_leds pcspkr thinkpad_acpi snd_timer agpgart syscopyarea sysfillrect sysimgblt ucsi_acpi fb_sys_fops mei_me typec_ucsi nvram rfkill mei intel_pch_thermal typec wmi snd shpchp soundcore led_class i2c_hid evdev ac battery hid rtc_cmos mac_hid vboxnetflt(O) vboxnetadp(O)
[  490.556541]  vboxpci(O) vboxdrv(O) coretemp msr overlay sg crypto_user acpi_call(O) ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 fscrypto dm_crypt algif_skcipher af_alg sd_mod uas usb_storage scsi_mod crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc serio_raw atkbd libps2 aesni_intel xhci_pci aes_x86_64 crypto_simd glue_helper xhci_hcd cryptd usbcore usb_common i8042 serio dm_mod
[  490.556613] CPU: 2 PID: 539 Comm: bumblebeed Tainted: G           O     4.16.13-1-ARCH #1
[  490.556616] Hardware name: LENOVO 20H9001EGE/20H9001EGE, BIOS N1VET37W (1.27 ) 11/16/2017
[  490.556622] RIP: 0010:pci_disable_device+0x8a/0xa0
[  490.556625] RSP: 0018:ffffb0fc01f53dd0 EFLAGS: 00010286
[  490.556630] RAX: 0000000000000000 RBX: ffff8f1d2caf3000 RCX: 0000000000000001
[  490.556633] RDX: 0000000080000001 RSI: 0000000000000092 RDI: 00000000ffffffff
[  490.556636] RBP: ffff8f1d2ca7d720 R08: 000001557f96467b R09: 00000000000007e5
[  490.556639] R10: ffffffff905dc720 R11: 0000000000000000 R12: 0000561c612d5ac0
[  490.556642] R13: ffffb0fc01f53f00 R14: 0000561c612d5ac0 R15: 0000000000000000
[  490.556646] FS:  00007f3a858a1040(0000) GS:ffff8f1d3f500000(0000) knlGS:0000000000000000
[  490.556649] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  490.556652] CR2: 00007fc10c98b490 CR3: 000000045d946006 CR4: 00000000003606e0
[  490.556655] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  490.556658] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  490.556660] Call Trace:
[  490.556675]  bbswitch_off.cold.4+0xc1/0x1d1 [bbswitch]
[  490.556683]  ? bbswitch_proc_write+0xaf/0xd0 [bbswitch]
[  490.556689]  ? proc_reg_write+0x3c/0x60
[  490.556694]  ? __vfs_write+0x36/0x170
[  490.556701]  ? vfs_write+0xa9/0x190
[  490.556706]  ? SyS_write+0x4f/0xb0
[  490.556714]  ? do_syscall_64+0x74/0x190
[  490.556720]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  490.556725] Code: 01 48 85 ed 75 07 48 8b ab b0 00 00 00 48 8d bb a0 00 00 00 e8 a8 03 13 00 48 89 ea 48 c7 c7 50 65 e9 8f 48 89 c6 e8 e0 3f cc ff <0f> 0b eb 8f 48 89 df e8 ea fe ff ff 80 a3 c1 07 00 00 f7 5b 5d 
[  490.556817] ---[ end trace bb8ef13124112189 ]---
[  490.577225] thinkpad_acpi: EC reports that Thermal Table has changed
Zeben commented 6 years ago

@real-or-random I can to confirm your issue because I've got the same today. I have Dell Vostro 5459 laptop with discrette Nvidia GPU. When I connect AC adapter to my laptop - the discrette GPU is enabling and my laptop produces a lot of heat and endless spinning of a fan. sensors util is showing me a current temp of GPU (+50...+60°C), which shouldn't to show in case of disabled GPU (+1°C). Rollback from 4.16.13 to 4.16.12 solved the trick. Waiting for Linux 4.17 in hope that the issue is solved there.

Zeben commented 6 years ago

Bump. There is the same problem in Linux 4.17.2. Sadly, need to rollback to 4.16.3 to make bbswitch working correctly. :/

[14754.883889] ------------[ cut here ]------------
[14754.883891] pci 0000:01:00.0: disabling already-disabled device
[14754.883902] WARNING: CPU: 3 PID: 469 at drivers/pci/pci.c:1650 pci_disable_device+0x8a/0xa0
[14754.883903] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c br_netfilter bridge stp llc overlay uas usb_storage ccm uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev media cmac rfcomm bnep btusb btrtl btbcm btintel bluetooth ecdh_generic fuse arc4 iwlmvm bbswitch(O) mac80211 joydev mousedev intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp iwlwifi kvm_intel snd_soc_skl iTCO_wdt iTCO_vendor_support hid_multitouch hid_generic snd_soc_skl_ipc dell_wmi wmi_bmof sparse_keymap mxm_wmi snd_hda_ext_core cfg80211 kvm nls_iso8859_1 snd_soc_sst_dsp nls_cp437 snd_soc_sst_ipc
[14754.883946]  vfat fat dell_laptop snd_soc_acpi dell_smbios irqbypass dell_wmi_descriptor dcdbas crct10dif_pclmul crc32_pclmul snd_soc_core ghash_clmulni_intel snd_hda_codec_hdmi pcbc snd_compress dell_smm_hwmon ac97_bus snd_pcm_dmaengine snd_hda_codec_conexant snd_hda_codec_generic aesni_intel r8169 aes_x86_64 crypto_simd cryptd mii glue_helper intel_cstate snd_hda_intel snd_hda_codec snd_hda_core intel_uncore snd_hwdep input_leds intel_rapl_perf psmouse led_class pcspkr idma64 snd_pcm snd_timer snd i2c_i801 mei_me intel_lpss_pci mei shpchp i2c_hid processor_thermal_device soundcore intel_lpss intel_soc_dts_iosf intel_pch_thermal hid int3402_thermal dell_rbtn int3400_thermal battery rtc_cmos ac wmi int340x_thermal_zone acpi_thermal_rel evdev rfkill mac_hid sg crypto_user ip_tables x_tables ext4 crc32c_generic
[14754.883986]  crc16 mbcache jbd2 fscrypto sd_mod serio_raw atkbd libps2 ahci xhci_pci libahci xhci_hcd libata crc32c_intel usbcore scsi_mod usb_common i8042 serio i915 intel_gtt i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm agpgart
[14754.884005] CPU: 3 PID: 469 Comm: bumblebeed Tainted: G           O      4.17.2-1-ARCH #1
[14754.884006] Hardware name: Dell Inc. Vostro 14-5459/080W31, BIOS 1.1.1 09/22/2017
[14754.884009] RIP: 0010:pci_disable_device+0x8a/0xa0
[14754.884010] RSP: 0018:ffffbbe6426ffdd8 EFLAGS: 00010286
[14754.884012] RAX: 0000000000000000 RBX: ffffa344b9ab0000 RCX: 0000000000000001
[14754.884013] RDX: 0000000080000001 RSI: 0000000000000082 RDI: 00000000ffffffff
[14754.884014] RBP: ffffa344b9a34630 R08: 00002037dd036bf4 R09: 00000000000004ab
[14754.884015] R10: ffffffffa85ef6e0 R11: 0000000000000000 R12: 000055985a2d9ac0
[14754.884016] R13: ffffbbe6426fff08 R14: 000055985a2d9ac0 R15: 0000000000000000
[14754.884017] FS:  00007fd236f31040(0000) GS:ffffa344c3d80000(0000) knlGS:0000000000000000
[14754.884018] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[14754.884019] CR2: 00007fff7a920b98 CR3: 000000026cd76005 CR4: 00000000003606e0
[14754.884020] Call Trace:
[14754.884028]  bbswitch_off.cold.4+0xc1/0x1d1 [bbswitch]
[14754.884030]  bbswitch_proc_write+0xaf/0xd0 [bbswitch]
[14754.884034]  proc_reg_write+0x3c/0x60
[14754.884036]  __vfs_write+0x36/0x170
[14754.884039]  vfs_write+0xa9/0x190
[14754.884041]  ksys_write+0x4f/0xb0
[14754.884045]  do_syscall_64+0x5b/0x170
[14754.884047]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[14754.884049] RIP: 0033:0x7fd23622e9d4
[14754.884050] RSP: 002b:00007fff7a923448 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[14754.884052] RAX: ffffffffffffffda RBX: 0000000000000028 RCX: 00007fd23622e9d4
[14754.884053] RDX: 0000000000000028 RSI: 000055985a2d9ac0 RDI: 0000000000000004
[14754.884054] RBP: 000055985a2d9ac0 R08: 00007fd236f31040 R09: 000000000000002c
[14754.884055] R10: 00000000000001b6 R11: 0000000000000246 R12: 000055985a2d6600
[14754.884056] R13: 0000000000000028 R14: 00007fd2364f75c0 R15: 0000000000000028
[14754.884058] Code: 01 48 85 ed 75 07 48 8b ab b0 00 00 00 48 8d bb a0 00 00 00 e8 48 19 13 00 48 89 ea 48 c7 c7 00 8a ea a7 48 89 c6 e8 70 32 cb ff <0f> 0b eb 8f 48 89 df e8 ea fe ff ff 80 a3 c1 07 00 00 f7 5b 5d 
[14754.884090] ---[ end trace 48b439d307758e63 ]---
[14754.943731] pci 0000:01:00.0: Refused to change power state, currently in D0
Lekensteyn commented 6 years ago

Thanks @real-or-random for narrowing down the version range. I suspect that torvalds/linux@abf92f86361b is causing this issue. bbswitch does some ugly things which might interfer with that commit.

You could try the pm-rework branch (possibly without the last vga_switcheroo patch) and see if it improves the situation for you. It is architecturally much more different.

liskin commented 6 years ago

But what's the point of bbswitch with https://github.com/torvalds/linux/commit/abf92f86361b? At least on ThinkPad T25 it seems that simply disabling bbswitch and enabling runtime-pm in laptop-mode-tools is enough for the dGPU to power off when the nvidia module is unloaded and power back on when it gets loaded.

Without https://github.com/torvalds/linux/commit/abf92f86361b, this doesn't work since as soon as runtime-pm is enabled for the card, the PCI state is lost and there's no way to enable it without a reboot, and this is true for both nvidia and nouveau. But with the patch, everything seems to work just fine. (Except bumblebeed which insists on having a PMMethod to unload the module, so I now need to rmmod it manually. I guess that's worth a separate bug report.)

Now I just need to check whether dropping bbswitch and using port pm fixes the battery drain during system suspend. :-)

Zeben commented 6 years ago

@liskin can You describe more details about the way to setup power save on Linux properly? Now I have the issue described below: I use Arch Linux with latest updates. I usually play some games using dGPU using proprietary Nvidia driver + bbswitch + bumblebee. I also use laptop-mode-tools. No any config files edited, so they're all in default state. When AC adapter is connected to my laptop - all works fine. When I disconnect AC from my laptop - bumblebee stops working with the messages:

[  205.460191] [INFO]Response: No - error: Could not load GPU driver
[  205.460214] [ERROR]Cannot access secondary GPU - error: Could not load GPU driver
[  205.460222] [DEBUG]Socket closed.

... and dmesg response:

[   71.717617] bbswitch: enabling discrete graphics
[   75.875371] pci 0000:01:00.0: enabling device (0000 -> 0003)
[   75.982305] ipmi message handler version 39.2
[   75.983821] ipmi device interface
[   76.100767] nvidia: module license 'NVIDIA' taints kernel.
[   76.100768] Disabling lock debugging due to kernel taint
[   76.116347] nvidia-nvlink: Nvlink Core is being initialized, major device number 238
[   76.216157] NVRM: The NVIDIA GPU 0000:01:00.0
               NVRM: (PCI ID: 10de:1346) installed in this system has
               NVRM: fallen off the bus and is not responding to commands.
**[   76.266331] nvidia: probe of 0000:01:00.0 failed with error -1**
[   76.266363] NVRM: The NVIDIA probe routine failed for 1 device(s).
[   76.266364] NVRM: None of the NVIDIA graphics adapters were initialized!

After this dGPU keeps powered on: acpi says me 2x less battery time remaining. To disable dGPU I need to restart bumblebeed service, but I still can't to use dGPU when AC adapted is unplugged. At linux 4.6.12 the thing was working without the problems, with pci_port_pm=off.

Second: when I try to watch some videos using VLC - my dGPU enables and keeps enabled even my VLC is closed. dGPU keeps working. When I try to restart bumblebeed - I'm getting stack trace from dmesg:

...
[  829.838962] CPU: 0 PID: 18843 Comm: bumblebeed Tainted: P           O      4.17.4-1-ARCH #1
[  829.838962] Hardware name: Dell Inc. Vostro 14-5459/080W31, BIOS 1.1.4 05/14/2018
[  829.838965] RIP: 0010:pci_disable_device+0x8a/0xa0
[  829.838966] RSP: 0018:ffffb274812d7dd8 EFLAGS: 00010286
[  829.838967] RAX: 0000000000000000 RBX: ffffa048f9aa7000 RCX: 0000000000000001
[  829.838968] RDX: 0000000080000001 RSI: 0000000000000082 RDI: 00000000ffffffff
[  829.838969] RBP: ffffa048f9a2a390 R08: 000001d20c68992c R09: 0000000000000390
[  829.838970] R10: ffffffff895ef6e0 R11: 0000000000000000 R12: 0000559091bb8ac0
[  829.838970] R13: ffffb274812d7f08 R14: 0000559091bb8ac0 R15: 0000000000000000
[  829.838972] FS:  00007fe4675b3040(0000) GS:ffffa04903c00000(0000) knlGS:0000000000000000
[  829.838973] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  829.838973] CR2: 00007ffdb1dc7dd0 CR3: 000000018414a001 CR4: 00000000003606f0
[  829.838974] Call Trace:
[  829.838992]  bbswitch_off.cold.4+0xc1/0x1d1 [bbswitch]
[  829.838994]  bbswitch_proc_write+0xaf/0xd0 [bbswitch]
[  829.839002]  proc_reg_write+0x3c/0x60
[  829.839013]  __vfs_write+0x36/0x170
[  829.839018]  vfs_write+0xa9/0x190
[  829.839020]  ksys_write+0x4f/0xb0
[  829.839022]  do_syscall_64+0x5b/0x170
[  829.839025]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

...

My assumptions looks like these below: 1, From linux > 4.16.12 a power management has been improved with some major changes, which interferers / brokes standard bbswitch behaviour.

  1. My laptop-mode-tools configs also interferers with new linux and its new PM mechanisms. Please, help me to solve the issue. I just really know the one thing: keep working under Linux < 4.16.13 isn't too good idea, but as temporary solution I can to try to use linux-lts.
liskin commented 6 years ago

@Zeben You need to disable/drop bbswitch. If you still get "fallen off the bus and is not responding to commands" afterwards, that means https://github.com/torvalds/linux/commit/abf92f86361b isn't working with your hardware and you might need to revert to pci_port_pm=off and/or blacklisting the device in laptop-mode-tools' runtime-pm.

Zeben commented 6 years ago

@liskin Hello. Thank You very much for fast response. I've been trying to experiment with the combinations of proper setup and I've got the state "one thing - solved, second thing - broken":

  1. I've removed pcie_runtime_pm=off from a kernel command-line options;
  2. I've enabled runtime-pm in LMT via lmt_gui;
  3. I've uninstalled bbswitch package.

Results:

  1. optirun and application used by it works without any problems, both after AC plug and AC unplug;
  2. nvidia kernel modules doesn't unloads after I exit from application that optirun used by;
  3. My dGPU keeps powered on. Always. Even if unused. Battery discharged 2x faster, fan spins more often. 3.1. I was testing it via acpi command and estimated battery time. With bbswitch installed dGPU disables properly, but... All the things returns to the symptoms from my previous message.

I'm really confused. :(

liskin commented 6 years ago

No need to be confused, this is expected, and I already mentioned it:

(Except bumblebeed which insists on having a PMMethod to unload the module, so I now need to rmmod it manually. I guess that's worth a separate bug report.)

Try rmmod nvidia-modeset && rmmod nvidia and see if power usage is okay. Next optirun should load these modules again and power up the card.

Zeben commented 6 years ago

@liskin got the notes. I tried to rmmod the modules, but it seems that dGPU keeps powered on, even after laptop restart, but I'm not sure about the case where I can to see that dGPU exactly powered or not. I guess just on acpi and powertop commands' results. Here the results after laptop restarted and all nvidia-releated modules unloaded:

skovo@devliner ~ % lsmod | grep nv
skovo@devliner ~ % acpi
Battery 0: Discharging, 75%, 02:43:06 remaining

Powertop reports:

The battery reports a discharge rate of 12.2 W

But here the results right after "hacky tricks": I installed right back bbswitch module and restarted bumblebeed:

skovo@devliner ~ % acpi
Battery 0: Discharging, 70%, 04:09:52 remaining

... and Powertop:

The battery reports a discharge rate of 5.58 W
liskin commented 6 years ago

@Zeben Well, that's strange, in my case rmmod and laptop-mode-tools is enough to power the dGPU down, which on this ThinkPad means going from 7 watts to 5 watts. And I never restart.

Also, the runtime_status of the PCI device shows suspended:

$ cat /sys/bus/pci/devices/0000\:02\:00.0/power/runtime_status 
suspended
Zeben commented 6 years ago

@liskin thank you again for the command, now I can exactly see whener dGPU powered or not. There is results after laptop restart:

$ lspci | grep GeForce
01:00.0 3D controller: NVIDIA Corporation GM108M [GeForce 930M] (rev a2)
$ cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_status
active

No any nvidia modules loaded.

And.. After installing bbswitch again and restarting bumblebeed.service...

$ dmesg | tail
[  372.669685] bbswitch: loading out-of-tree module taints kernel.
[  372.669940] bbswitch: version 0.8
[  372.669946] bbswitch: Found integrated VGA device 0000:00:02.0: \_SB_.PCI0.GFX0
[  372.669952] bbswitch: Found discrete VGA device 0000:01:00.0: \_SB_.PCI0.RP01.PEGP
[  372.669962] ACPI Warning: \_SB.PCI0.RP01.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20180313/nsarguments-66)
[  372.670105] bbswitch: detected an Optimus _DSM function
[  372.792842] pci 0000:01:00.0: enabling device (0006 -> 0007)
[  372.793172] bbswitch: Succesfully loaded. Discrete card 0000:01:00.0 is on
[  372.800773] bbswitch: disabling discrete graphics
$ cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_status
suspended
liskin commented 6 years ago

Hm, but laptop-mode-tools/tls should really power it off itself. bbswitch and runtime pm don't interact cleanly (I've heard about memory corruption).

Zeben commented 6 years ago

@liskin after dozen of experiments I've got some little progress in this trick. I've completely uninstalled laptop-mode-tools and installed tlp package I didn't know about before. I've learned some pieces of a documentation for the package, updated linux back to 4.17.4, enabled pcie_port_pm=on uninstalled bbswitch and did some tests. Results:

  1. tlp doesn't enable runtime PM for my dGPU:
    >> Bad           Enable SATA link power management for host0                                                            
    Bad           Enable SATA link power management for host1
    Bad           VM writeback timeout
    Bad           Runtime PM for PCI Device NVIDIA Corporation GM108M [GeForce 930M]
    ...
    ...

    ... but if I enable it manually via powertop - I'm getting amazing power saving results!

    $ acpi
    Battery 0: Discharging, 84%, **07:12:10** remaining
    1. optirun works as expected, right as You mentioned about it. To disable dGPU, I need to remove nvidia_modeset and nvidia modules. No any else issues I've found.
      
      $ optirun -b primus glxgears -info 
      GL_RENDERER   = GeForce 930M/PCIe/SSE2
      GL_VERSION    = 4.6.0 NVIDIA 396.24
      GL_VENDOR     = NVIDIA Corporation
      ...
      $ cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_status
      active

disabled optirun-releated application

$ cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_status active

removed modules

$ sudo rmmod nvidia_modeset $ sudo rmmod nvidia $ cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_status suspended

So, the last things in this trick is:
1. "teach" bumblebee to unload needed modules properly 
2. "teach" `tlp` to enable runtime PM for dGPU (the last trick is not solved).
3. dGPU is powering on when I plug AC adapter; `cat` command gives me wrong results:

cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_status suspended

$ sensors iwlwifi-virtual-0 Adapter: Virtual device temp1: +51.0°C

dell_smm-virtual-0 Adapter: Virtual device Processor Fan: 3041 RPM fan2: N/A CPU: +41.0°C
GPU: +53.0°C


**GPU:            +53.0°C**
... and spinning of fan.
:/
Zeben commented 6 years ago

Bump:

dGPU is powering on when I plug AC adapter; cat command gives me wrong results:

The issue was solved via changing RUNTIME_PM_ON_AC from on to auto, which enables runtime PM for all PCI devices, even if AC adapter is plugged on. But I'm not sure if it's right...

liskin commented 6 years ago

@Zeben The whole point of disabling pm on AC is that you get rid of those (possibly) tens/hundreds of milliseconds waits for devices to power up. Try doing lspci on battery: it's not instanstaneous but takes almost a second. Try plugging in headphones: there's a somewhat annoying click when the soundcard powers down. On the other hand, with pm enabled, you'll hear your fan a lot less often. It's your decision to make. :-)

Zeben commented 6 years ago

So, after more experiments I've got some conclusions.

All works without problems with three types of configurations.

  1. Installed packages: linux 4.17.4, bbswitch, bumblebee, tlp, tlp-ui, nvidia. Blacklisted 01:00.0 Nvidia card in RUNTIME_PM_BLACKLIST variable. Using pcie_port_pm=on in kernel command-line options. Works with plugging/unplugging AC adapter. Works enabling/disabling dGPU power. No any errors.

  2. Installed packages linux 4.17.4, bumblebee, tlp, tlp-ui, nvidia. bbswitch removed. Using pcie_port_pm=on in kernel command-line options. Tip-1: to make dGPU able to power-off, we need to unload nvidia and nvidia_modeset kernel modules manually. Tip-2: When AC adapter plugged/unplugged, dGPU keeps powered on. We need to find our dGPU vendor/device via lspci -nn and add the device into udev rules and always set it to auto, instead on. I guess it will be a default configuration in future versions of Linux-based distributions, after some fixes.

  3. Legacy configuration. Installed packages: linux 4.16.12, bumblebee, bbswitch-dkms, nvidia-dkms, laptop-mode-tools. Using pcie_port_pm=off in kernel command-line options. No any additional changes needed.

Many thanks for @liskin for suggestions and tips. Maybe our conversation will be helpful for those who have same issues. Waiting for complete implementation of dynamic switchable graphics, out-the-box, without bbswitch.

real-or-random commented 6 years ago

Hm, for me removing pcie_port_pm=off does not help. Without that, I cannot load the nvidia driver.

However, the problem with 4.16.13 went away in a later kernel version (actually already some weeks ago, I just forgot to report it here). So for me, pcie_port_pm=off is still the way to go...

Zeben commented 6 years ago

@real-or-random I've combined two technologies to make using swichable graphics possible: runtime PM for all devices (by keeping pcie_port_pm=on or removing it completely) and blacklisting dGPU in tlp. As a result, bbswitch doesn't interferer with linux, its new runtime PM; bbswitch-releated tracebacks in dmesg is also gone. Now bbswitch completely controls dGPU device and the device isn't controlled by runtime PM.

real-or-random commented 6 years ago

I tried that but it didn't work. But I'm not convinced that the blacklisting in tlp worked because powertop still showed that PM enabled on for the NVIDIA card. Is that the right place to check? (Where can I check manually?)

IngeniousDox commented 6 years ago

@liskin I have a Dell XPS 15 9570, I have reached the same point as you. I'm using runtime PM without bbswitch, because using bbswitch (normal or pm-rework branch) both result in a dGPU you cannot power back on. Optirun / Primusrun both work, they load the nvidia module, but unloading afterwards does not work. I tried normal bumblebee and bumblebee-git with development branch with libkmod2. So I have to remove with with modprobe -r.

You said something about making a bug report about it, but I can't find anything. Did I miss it? or were you still planning to make it? I guess we need a new PMMethod=modules_only, that only unloads the modules? You seem to know what the issue is, I'm rather new laptop with nvidia / bumblebee.

@Lekensteyn We talked on IRC briefly about your pm-rework branch. I thought it was working, since compared to the normal bbswitch, it turned the dGPU on / off. However I could not load the nvidia module due to:

[ 1604.981868] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
[ 1604.982029] NVRM: This is a 64-bit BAR mapped above 4GB by the system
               NVRM: BIOS or the Linux kernel, but the PCI bridge
               NVRM: immediately upstream of this GPU does not define
               NVRM: a matching prefetchable memory window.

This probably has to do with the torvalds/linux@abf92f8 commit you already pointed out before. As you suggested I switched to using runtime-pm. (where Bumblebee is not unloading the nvidia module, as I wrote at the start of this post). But I figured I followed you up on my attempt to use pm-rework.