Bumblebee-Project / Bumblebee

Bumblebee daemon and client rewritten in C
http://www.bumblebee-project.org/
GNU General Public License v3.0
1.29k stars 142 forks source link

[Debian] Nvidia Card randomly turning on #144

Closed Hoverbear closed 11 years ago

Hoverbear commented 12 years ago

Hi all, Having some issues with Bumblebee on Debian Wheezy... Upon starting the daemon power management seems to be working alright, after awhile, it seems that the card just randomly turns on. At first I thought it was flash accessing the nvidia 32-bit glx or something, but I don't think that's the case.

Using Bumblebee with the Nvidia Binary driver (Though the problem exists with Nouveau as well)

# cat /etc/modprobe.d/nvidia.conf 
blacklist nvidia

Here's a grab from /var/log/messages:

May  2 08:12:22 turing kernel: [14433.970674] nvidia 0000:01:00.0: power state changed by ACPI to D0
May  2 08:12:22 turing kernel: [14433.970697] nvidia 0000:01:00.0: power state changed by ACPI to D0
May  2 08:12:22 turing kernel: [14433.970709] nvidia 0000:01:00.0: enabling device (0000 -> 0003)
May  2 08:12:22 turing kernel: [14433.970724] nvidia 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
May  2 08:12:22 turing kernel: [14433.970731] thinkpad_acpi: EC reports that Thermal Table has changed
May  2 08:12:22 turing kernel: [14433.970775] vgaarb: device changed decodes: PCI:0000:01:00.0,olddecodes=none,decodes=none:owns=none
May  2 08:12:22 turing kernel: [14433.971486] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  295.40  Thu Apr  5 21:37:00 PDT 2012
May  2 08:23:14 turing kernel: [15085.014467] bbswitch: disabling discrete graphics
May  2 08:23:14 turing kernel: [15085.030200] pci 0000:01:00.0: Refused to change power state, currently in D0
May  2 08:23:14 turing kernel: [15085.031875] thinkpad_acpi: EC reports that Thermal Table has changed
May  2 08:23:14 turing kernel: [15085.142589] pci 0000:01:00.0: power state changed by ACPI to D3
May  2 08:23:25 turing kernel: [15095.819263] nvidia 0000:01:00.0: power state changed by ACPI to D0
May  2 08:23:25 turing kernel: [15095.819273] thinkpad_acpi: EC reports that Thermal Table has changed
May  2 08:23:25 turing kernel: [15095.819339] nvidia 0000:01:00.0: power state changed by ACPI to D0
May  2 08:23:25 turing kernel: [15095.819350] nvidia 0000:01:00.0: enabling device (0000 -> 0003)
May  2 08:23:25 turing kernel: [15095.819366] nvidia 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
May  2 08:23:25 turing kernel: [15095.819401] vgaarb: device changed decodes: PCI:0000:01:00.0,olddecodes=none,decodes=none:owns=none
May  2 08:23:25 turing kernel: [15095.820118] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  295.40  Thu Apr  5 21:37:00 PDT 2012
May  2 08:23:30 turing kernel: [15101.269372] bbswitch: disabling discrete graphics
May  2 08:23:30 turing kernel: [15101.285279] pci 0000:01:00.0: Refused to change power state, currently in D0
May  2 08:23:30 turing kernel: [15101.286816] thinkpad_acpi: EC reports that Thermal Table has changed
May  2 08:23:31 turing kernel: [15101.397524] pci 0000:01:00.0: power state changed by ACPI to D3
May  2 08:24:30 turing kernel: [15160.564168] nvidia 0000:01:00.0: power state changed by ACPI to D0
May  2 08:24:30 turing kernel: [15160.564178] nvidia 0000:01:00.0: power state changed by ACPI to D0
May  2 08:24:30 turing kernel: [15160.564183] nvidia 0000:01:00.0: enabling device (0000 -> 0003)
May  2 08:24:30 turing kernel: [15160.564191] nvidia 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
May  2 08:24:30 turing kernel: [15160.564209] vgaarb: device changed decodes: PCI:0000:01:00.0,olddecodes=none,decodes=none:owns=none
May  2 08:24:30 turing kernel: [15160.564527] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  295.40  Thu Apr  5 21:37:00 PDT 2012
May  2 08:24:30 turing kernel: [15160.565381] thinkpad_acpi: EC reports that Thermal Table has changed
May  2 08:27:57 turing kernel: [15366.801038] bbswitch: disabling discrete graphics
May  2 08:27:57 turing kernel: [15366.814651] pci 0000:01:00.0: Refused to change power state, currently in D0
May  2 08:27:57 turing kernel: [15366.815807] thinkpad_acpi: EC reports that Thermal Table has changed
May  2 08:27:57 turing kernel: [15366.927034] pci 0000:01:00.0: power state changed by ACPI to D3
May  2 08:28:25 turing kernel: [15395.063147] bbswitch: enabling discrete graphics
May  2 08:28:25 turing kernel: [15395.304803] pci 0000:01:00.0: power state changed by ACPI to D0
May  2 08:28:25 turing kernel: [15395.304826] pci 0000:01:00.0: power state changed by ACPI to D0
May  2 08:28:25 turing kernel: [15395.304855] thinkpad_acpi: EC reports that Thermal Table has changed
May  2 08:28:25 turing kernel: [15395.304943] pci 0000:01:00.0: power state changed by ACPI to D0
May  2 08:28:25 turing kernel: [15395.304951] pci 0000:01:00.0: power state changed by ACPI to D0
May  2 08:28:25 turing kernel: [15395.304968] pci 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
May  2 08:28:25 turing kernel: [15395.339725] bbswitch: disabling discrete graphics
May  2 08:28:25 turing kernel: [15395.355479] thinkpad_acpi: EC reports that Thermal Table has changed
May  2 08:28:25 turing kernel: [15395.466181] pci 0000:01:00.0: power state changed by ACPI to D3
Hoverbear commented 12 years ago

Managed to get a clip of /var/log/messages right after (I think) the card turned on


/var/log# tail messages
May  2 08:35:47 turing kernel: [15836.349142] e1000e 0000:00:19.0: BAR 1: set to [mem 0xf392b000-0xf392bfff] (PCI address [0xf392b000-0xf392bfff])
May  2 08:35:47 turing kernel: [15836.349152] e1000e 0000:00:19.0: BAR 2: set to [io  0x6080-0x609f] (PCI address [0x6080-0x609f])
May  2 08:35:59 turing kernel: [15847.909345] nvidia 0000:01:00.0: power state changed by ACPI to D0
May  2 08:35:59 turing kernel: [15847.909353] thinkpad_acpi: EC reports that Thermal Table has changed
May  2 08:35:59 turing kernel: [15847.909386] nvidia 0000:01:00.0: power state changed by ACPI to D0
May  2 08:35:59 turing kernel: [15847.909402] nvidia 0000:01:00.0: enabling device (0000 -> 0003)
May  2 08:35:59 turing kernel: [15847.909423] nvidia 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
May  2 08:35:59 turing kernel: [15847.909469] vgaarb: device changed decodes: PCI:0000:01:00.0,olddecodes=none,decodes=none:owns=none
May  2 08:35:59 turing kernel: [15847.910192] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  295.40  Thu Apr  5 21:37:00 PDT 2012
May  2 08:36:15 turing kernel: [15863.962150] [Hardware Error]: Machine check events logged
Lekensteyn commented 12 years ago

What is that "Hardware Error"? After blacklisting, did you run update-initramfs -u? Always check whether /dev/nvidia{ctl,0} exist. If it does, then you can expect your card to power on at random.

Hoverbear commented 12 years ago

Hi Lekensteyn, I just updated Bumblebee from http://suwako.nomanga.net/ .... Going to do some things and see if the problem is still present. I did not update-initramfs, I've run that now. /dev/nvidia{ctl,0} Both exist however.

Lekensteyn commented 12 years ago

update-initramfs just makes sure that the blacklist applies at next boot. When /dev/nvidia exist, it will turn your card on at random. So, by blacklisting, you prevent that from happening on the next boot until you use the nvidia card through optirun.

Hoverbear commented 12 years ago

After doing update-initramfs -u and rebooting, as well as updating to the newest bumblebee from said repo.... The problem seems to be solved. I will re-open the issue if I encounter the issue again.

Thank you Lekensteyn for your help.

Hoverbear commented 12 years ago

It appears I'm still having this issue. I have the nvidia module blacklisted in my /etc/modprobe.d/nvidia.conf, have run update-initramfs -u.

Lekensteyn commented 12 years ago

Does /dev/nvidia* exist after such a failure? Note that you can just purge the nvidia driver if you do not need to use the nvidia card through optirun or CUDA.

Hoverbear commented 12 years ago

I'm noting the appearance of /dev/nvidia{ctl,0} and /dev/nvram even when the card is off. I would like to keep ability to utilize the Nvidia card in the future, as I am interested in CUDA.

$  lspci -v -d 10de:
01:00.0 VGA compatible controller: NVIDIA Corporation GF119 [Quadro NVS 4200M] (rev ff) (prog-if ff)
    !!! Unknown header type 7f
$ ls /dev/nv*
/dev/nvidia0  /dev/nvidiactl  /dev/nvram
Lekensteyn commented 12 years ago

Whenever the card is enabled, you can disable it by triggering a start/stop action: optirun true. This is not ideal, but at least it disables the card. I've noticed that some users who have CUDA installed experience this issue more often.

Hoverbear commented 12 years ago

Hi Lekensteyn, I guess that's a suitable workaround. Thank you. Closing this issue.

Lekensteyn commented 12 years ago

@Hoverbear I've noticed that the NVreg_ModifyDeviceFiles=0 module option prevents the /dev/nvidia* files from being created. This does indeed what its description suggests, rendering the module unusable. However, it also seems that these files can be removed safely when the module is unloaded. Can you try that and see if you have any issues when manually removing that file?

Hoverbear commented 12 years ago

Hi Lekensteyn, I actually ended up erasing that installation and trying F17, however I'm back to debian and no longer seem to be having the issue.

Lekensteyn commented 12 years ago

Ah, with the proprietary nvidia driver?

Hoverbear commented 12 years ago

Yes, in fact.

On Tue, May 8, 2012 at 6:26 AM, Peter < reply@reply.github.com

wrote:

Ah, with the proprietary nvidia driver?


Reply to this email directly or view it on GitHub:

https://github.com/Bumblebee-Project/Bumblebee/issues/144#issuecomment-5574467

hni commented 12 years ago

Hi Lekensteyn, I have the same issue on Arch Linux. So what is the procedure if I want to use optirun but I do not want the nvidia card to turn on at random. Will deleting /dev/nvidia* or using NVreg_ModifyDeviceFiles=0 solve the problem permanently without interfering with bumblebee?

EDIT: manually running 'optirun true' is not really helping because often when I am not at my machine, nvidia will decide to activate the card resulting in an increase in temperature between 5 and 20 degrees

Lekensteyn commented 12 years ago

Do you happen to use CUDA?

hni commented 12 years ago

Not that I know. I haven't installed any CUDA related packages consciously. I use the nvidia proprietary driver, my gpu is NVS 4200M inside a T420s Thinkpad. Anecdotally, the only other person I have seen with this problem also uses a T420 with Arch Linux (reference: https://bbs.archlinux.org/viewtopic.php?pid=1112688#p1112688).

Lekensteyn commented 12 years ago

libvdpau is mentioned, do you have that installed? Can you try modifying the C source to delete /lib/nvidia0 and /lib/nvidiactl after disabling the video card? Have a look at src/switchers/switchers.c (iirc)

hni commented 12 years ago

yes, I have libvdpau installed. Working on modifying the C source at the moment. One thing I noticed is that when I do cat /dev/nvidia0, the nvidia card will switch on. Just mentioning it in case it is significant.

Lekensteyn commented 12 years ago

Yeah, that is exactly the problem here. libvdpau probes for the device which in its turn loads the driver and enables the card. That is really, really bad and totally undesirable. I'm curious of removing the /dev/nvidia{ctl,0} stuff helps.

hni commented 12 years ago

running my own version of bumblebee now with this change: https://github.com/hni/Bumblebee/commit/26f23f29441045bdaac7f0e4a8e46d3f67247e65

Will report back how it works, looks good so far.

hni commented 12 years ago

So far I have no issues. It might be worth adding something similar to the main branch to expose it to more users.

Lekensteyn commented 12 years ago

It may be worth filling this issue on nvnews.net. The char devices are supposed to get unregistered when the module is unloaded.

hni commented 12 years ago

Sorry, I am not familiar with that website. Is that an official NVIDIA forum? I was not able to find a bug tracker there. Do you mean I should start a thread here: http://www.nvnews.net/vbulletin/forumdisplay.php?f=14?

On 17 June 2012 14:26, Peter reply@reply.github.com wrote:

It may be worth filling this issue on nvnews.net. The char devices are supposed to get unregistered when the module is unloaded.


Reply to this email directly or view it on GitHub: https://github.com/Bumblebee-Project/Bumblebee/issues/144#issuecomment-6380834

Lekensteyn commented 12 years ago

Yup, that one.

hni commented 12 years ago

posted on the nvidia forums: http://www.nvnews.net/vbulletin/showthread.php?t=184442

powertomato commented 12 years ago

@hni Your changes didn't solve the issue for me. An other device-file named "/dev/nvram" was appearing, so I tried removing that one too. But that didn't solve it. After some search I noticed that despite blacklisting the nvidia module, it still was present in the init-ram-fs. This is actually a distribution bug but in case someone is dealing with this problem again I'd suggest checking the init-ram-fs before reopening the issue

Edit: As it didn't solve the issue I kept the nvram. I just listed I tried, in case it had something to do with this - thanks for the clarification. What I meant is that even though I rebuilt the init-ram-fs and blacklisted the correct module it still was present in there (I extracted the contents and nvidia.ko was inside).

Lekensteyn commented 12 years ago

NO! Do not remove /dev/nvram! That device has nothing to do with nvidia, it is a device for "Non-Volatile RAM". See also http://cateee.net/lkddb/web-lkddb/NVRAM.html After blacklisting the nvidia driver you might indeed need to rebuild your initramfs. It is important that you use the blacklist <module name without .ko* suffix>. On Ubuntu the module is named "nvidia-current.ko", therefore you must use "blacklist nvidia-current" instead of just "blacklist nvidia".

gnetwork-git commented 12 years ago

Hi guys. Lekensteyn you may remember me from the early days testing Bumblebee. On Mint 13 Bumblebee runs beautifully, best ever. I have recently been assisting with new distro SolusOS (Debian) and experiencing exactly same problem described here, and would like to implement the fix as described above, for myself and help the other users. The install was pretty seamless from repo at http://suwako.nomanga.net/ As for this current fix, is there any way to put the code to work in my existing install? or do i have to build from source from https://github.com/hni/Bumblebee/tarball/master Thankyou, G

nxdefiant commented 12 years ago

Probably not prefered solution, but you can always make an udev rule that deleted the devices on unload: /etc/udev/rules.d/99_nvidia.rules: DEVPATH=="/module/nvidia", ACTION=="remove", RUN+="/bin/rm /dev/nvidia0 /dev/nvidiactl"

Lekensteyn commented 12 years ago

Building from source would be the cleanest option, but nxdefiant also has a nice idea if that works.

gnetwork-git commented 12 years ago

ok thankyou. I was just hoping to make it easier for other users. might try that hack above too. I guess it will be included somehow in future, there are lots of Debian Wheezy users that may get stumped by this...

gnetwork-git commented 12 years ago

that hack is just that, thanks for effort nxdefiant. card stayed OFF ok thru tests flash etc. but optirun glxspheres would not work, optirun glxgears came up but with same warning as glxspheres:

optirun glxgears NVIDIA: could not open the device file /dev/nvidiactl (Permission denied). [VGL] WARNING: The OpenGL rendering context obtained on X display [VGL] :8 is indirect, which may cause performance to suffer. [VGL] If :8 is a local X display, then the framebuffer device [VGL] permissions may be set incorrectly.

nxdefiant commented 12 years ago

oh, I'm on Debian, there might be a difference here. My Debian has an udev rule which does give me permissions (I'm in group video): /lib/udev/rules.d/91-permissions.rules: ... SUBSYSTEM=="nvidia", GROUP="video" ... maybe we could use group bumblebee instead

Lekensteyn commented 12 years ago

You might need to change the device permissions http://us.download.nvidia.com/XFree86/Linux-x86_64/304.22/README/faq.html#devicenodes

gnetwork-git commented 12 years ago

no thats all fine, this is my /etc/modprobe.d/nvidia-kernel-common.conf

alias char-major-195* nvidia options nvidia NVreg_DeviceFileUID=0 NVreg_DeviceFileGID=44 NVreg_DeviceFileMode=0660 # To enable FastWrites and Sidebus addressing, uncomment these lines # options nvidia NVreg_EnableAGPSBA=1 # options nvidia NVreg_EnableAGPFW=1

# see #580894 blacklist nouveau

nxdefiant commented 12 years ago

Are you a member of group video?

gnetwork-git commented 12 years ago

$ getent group | grep video video:x:44:

nxdefiant commented 12 years ago

ok with id 44 as group video: Your user needs to be in the group video. Add your user to the group and relogin.

gnetwork-git commented 12 years ago

i dont think this is the way to go. the Permission denied message is probably because the file wasn't there when it tried to access it. i am more interested in a good solid solution to be integrated that should work for all.

Lekensteyn commented 12 years ago

If xorg runs as root, the device files should be created. You can create them manually though. https://wiki.archlinux.org/index.php/Lenovo_IdeaPad_Y580#Configurations

gnetwork-git commented 12 years ago

thanks to you both. I am settling in after a large meal and watching Batman Begins (2005). will try these hacks tomorrow, and also the "clean" way from hni's package. ultimately, i want to make a debianized package version available thru repository, for Wheezy and SolusOS. maybe I should contact the guy with the current debian repo too, he must know about this. i know this can be fixed on my machine, but feel its more useful to make it easily available to all (without them going thru 100 steps only to find it fails). will post then, thankyou.

gnetwork-git commented 12 years ago

@nxdefiant I'm on Debian Wheezy too, the only difference on SolusOS is Gnome 3 desktop is modified to look like Gnome 2. My /lib/udev/rules.d/91-permissions.rules also contains SUBSYSTEM=="nvidia", GROUP="video" I avoided your suggestion to add myself to group "video" as have no idea on implications of doing this (stability, security, etc), but after doing so all seems to work fine now, thankyou.

@Lekensteyn as the hni was incomplete, i merged with the standard bumblebee source, then built, the final outcome was a mess, enough said. The code you mentioned, added to /etc/rc.local didn't help (not sure if i did it right anyway - limited instructions).

You say "If xorg runs as root, the device files should be created". xorg is indeed running as root, and the files were not recreated.

nxdefiant's suggestion of creating /etc/udev/rules.d/99_nvidia.rules: DEVPATH=="/module/nvidia", ACTION=="remove", RUN+="/bin/rm /dev/nvidia0 /dev/nvidiactl" did work, but only after adding myself to group "video". From your knowledge is this a potential problem or security issue from doing this? or is there a better way, like patch or something? If ok, for now this is an easy enough fix for the problem in Debian Wheezy.

Lekensteyn commented 12 years ago

@gnetwork-git Hopefully you did not add the acpi-handle-hack, that was something machines-specific ;)

Adding the 99_nvidia.rules thing does not compromise security. Adding a user to the video group allows you to restrict access to a select group. It should be relatively safe, though it also allows members to access other /dev/dri/card* devices. I am not sure what the exact implications are, other than having a direct line with the kernel video driver.

hni commented 12 years ago

@gnetwork-git note that the patch is in the 'develop' branch, not master. Cloning and building that branch should work, I have been running it ever since I forked and there have been no changes

gnetwork-git commented 12 years ago

@hni i used the one from https://github.com/hni/Bumblebee/tarball/master and merged it with the standard.

@Lekensteyn Great. So we now have just a 2 line fix for the problem of cards turning on unnecessarily (usually by Flash or Mplayer, and not under optirun) in Debian Wheezy and possibly other distros.

Run as root and insert your username where appears $USER:

# echo 'DEVPATH=="/module/nvidia", ACTION=="remove", RUN+="/bin/rm /dev/nvidia0 /dev/nvidiactl"' >> /etc/udev/rules.d/99-nvidiactrl.rules # usermod -a -G video $USER

Will you be doing much more on Bumblebee, or winding down due to coming Prime release, maybe 6 months away? - though I will believe it when I see it!

nxdefiant commented 12 years ago

I'm wondering. I always thought beeing a member of the video group was required for dri to work?

gnetwork-git commented 12 years ago

@hni if your fork works well, i'm happy to try it, and make available as repo. just post download link and basic instructions. thanks.

hni commented 12 years ago

@gnetwork-git the download link is https://github.com/hni/Bumblebee/tarball/develop. Instructions are the same as for the unpatched bumblebee.

As written further above, I have tried to raise this issue in the nvidia forums so that the fix can be included in the nvidia proprietary driver, but there has been no answer for roughly a month. Hence the question remains whether this patch should be included in mainline. I personally think udev is not the right place to handle this, but I have no strong feelings either way. If it is decided this patch should not be merged into mainline, I might merge it into my forked master and create an Arch Linux package for convenience.

Lekensteyn commented 12 years ago

The devices are supposed to get removed on module unload. Why that isn't happening, I don't know.